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Abstract 

We propose a generalisation of the existing maximum entropy models used for 
spike trains statistics analysis, based on the thermodynamic formalism from er- 
godic theory, and allowing one to take into account memory effects in dynamics. 
We propose a spectral method which provides directly the "free-energy" density 
and the KuUback-Leibler divergence between the empirical statistics and the statis- 
tical model. This method does not assume a specific Gibbs potential form. It does 
not require the assumption of detailed balance and offers a control of finite-size 
sampling effects, inherent to empirical statistics, by using large deviations results. 
A numerical validation of the method is proposed and the perspectives regarding 
spike-train code analysis are also discussed. 
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1 Introduction 

Processing and encoding of information in neuronal dynamics is a very active research 
field II59I . although still much of the role of neural assemblies and their internal interac- 
tions remains unknown f55l. The simultaneously recording of the activity of groups of 
neurons (up to several hundreds) over a dense configuration, supplies a critical database 
to unravel the role of specific neural assembUes. In complement of descriptive statistics 
(e.g. by means of cross-correlograms or joint peri-stimulus time histograms), some- 
how difficult to interpret for a large number of units (review in [8^ 37|), is the specific 
analysis of multi-units spike-patterns, as found e.g. in |[T]. This approach develops 
algorithms to detect common patterns in a data block, as well as performing combina- 
torial analysis to compute the expected probability of different kind of patterns. The 
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main difficulty with such type of approaches is that they rely on a largely controversial 
assumption, Poissonian statistics (see Il57ll58l l66 l). which moreover, is a minimal sta- 
tistical model largely depending on the belief that firing rates are essentially the main 
characteristic of spike trains. 

A different approach has been proposed in |66|. They have shown that a model 
taking into account pairwise synchronizations between neurons in a small assembly 
(10-40 retinal ganglion cells) describes most (90%) of the correlation structure and of 
the mutual information of the block activity, and performs much better than a non- 
homogeneous Poissonian model. Analogous results were presented the same year in 
||73l . The model used by both teams is based on a probability distribution known as 
the Gibbs distribution of the Ising model which comes from statistical physics. The 
parameters of this distribution relating, in neural data analysis, to the firing rate of neu- 
rons and to their probability of pairwise synchronisation have to be determined from 
empirical data. Note that this approach has been previously presented in neuroscience, 
but in a slightly different and more general fashion, by ||45l [39l |46l (it was referred as 
"log-linear models"). The use of Ising model in neural decoding (especially of visual 
stimuli) has been largely exploited by several other authors lfT9ll53ll72l|78J . In partic- 
ular, it is believed by some of them [19] that the pairwise coupling terms inferred from 
simultaneous spikes corresponds, in the model, to effective couplings between gan- 
glion cells. In this spirit, computing the parameters of the Ising model would provide 
an indirect access to ganglion cells connections. In addition, an increasing number of 
different theoretical and numerical developments of this idea have recently appeared. 
In particular, in |80|, the authors propose a modified learning scheme and thanks to 
concepts taken from physics, such as heat capacity, explore new insights like the dis- 
tribution of the underlying density of states; additionally in ||6Tll60l authors study and 
compare several approximate, but faster, estimation methods for learning the couplings 
and apply them on experimental and synthetic data drawing several results for this type 
of modeling. 

However, it might be questionable whether more general form of Gibbs distribu- 
tions (e.g. involving «-uplets of neurons) could improve the estimation and account for 
deviations to Ising-model ( 1721 [SO'D and provide a better understanding of the neural 
code from the point of view of the maximal entropy principle |34J. As a matter of fact, 
back to 1995, B6l already considered multi-unit synchronizations and proposed sev- 
eral tests to understand the statistical significance of those synchronizations and the real 
meaning of their corresponding value in the energy expansion. A few years later, B31 
generalized this approach to arbitrary spatio-temporal spike patterns and compared this 
method to other existing estimators of high-order correlations or Bayesian approaches. 
They also introduced a method comparison based on the Neyman-Pearson hypoth- 
esis test paradigm. Though the numerical implementation they have used for their 
approach presented strong limitations, they have applied this methods successfully to 
experimental data from multi-units recordings in the pre-frontal cortex, the visual cor- 
tex of behaving monkeys, and the somato-sensory cortex of anesthetized rats. Several 
papers have pointed out the importance of temporal patterns of activity at the network 
level BTl [83l [69] . and recently I.78J have shown the insufficiency of Ising model to 
predict the temporal statistics of the neural activity. As a consequence, a few authors 
(we know only one reference, B4l ) have attempted to define time-dependent Gibbs 
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distributions on the base of a Markovian approach (1-step time pairwise correlations) 
and convincingly showed a clear increase in the accuracy of the spike train statistics 
characterization. Namely, this model produces a lower Jensen-Shannon Divergence, 
when analyzing raster data generated by a Glauber spin-glass model, but also in vivo 
multineuron data from cat parietal cortex in different sleep states. 

To summarize, the main advantages of all these 'Ising-Uke' approaches are: 

• (i) to be based on a widely used principle, the maximal entropy principle ||34l to 
determine statistics from the empirical knowledge of (ad hoc) observables; 

• (ii) to propose statistical models having a close analogy with Gibbs distributions 
of magnetic systems, hence disposing of several deep theoretical results and nu- 
merical methods (Monte-Carlo methods, Mean-Field approximations, series ex- 
pansions), resulting in a fast analysis of experimental data from large number of 
neurons. 

However, as we argue in this paper, this approaches presents also, in its current 
state, fundamental weaknesses: 

• (i) The maximal entropy principle leads, in its classical formulation, to a para- 
metric form, corresponding to chosing a finite set of ad hoc constraints, which 
only provides an approximation of the real statistics, while the distance (say 
measured by the KuUback-Leibler divergence) between the model and the hid- 
den distribution can be quite large [211. Moreover, when considering time depen- 
dent correlations, this procedure leads to Gibbs potential which requires a proper 
renormalisation in order to be related to a Markov chain (see section |2.3.6t . 

• (ii) The Gibbs distributions considered by these approaches, with the naive form 
"je^P^", where Z is a constant (while it depends on boundary terms in the gen- 
eral case) have a limited degree of application; in particular they do not extend 
easily to time dependent sequences with long memory, as spikes train emitted 
from neural networks might well be. Especially, considering already one time 
step Markov processes leads to substantial complications a shown in 14411 . The 
"partition function" is not a constant (see eq. (1) of paper [44|) and needs to be 
approximated (eq. (4) of the same paper) using the (unproved) assumption of de- 
tailed balance, which is moeover a sufficient but non necessary condition for the 
existence of an equilibrium state, and may hardly generalize to more elaborated 
models. 

• (iii) It does not allow to treat in a straightforward way the time-evolution of the 
Gibbs distribution (e.g. induced by mechanisms such as synaptic plasticity). 

However, more general forms of Gibbs distributions have been introduced since 
long Il74ll64l l5l. in a theory called "thermodynamic formalism" introduced in the realm 
of dynamical systems and ergodic theory, allowing to treat infinite time sequences of 
processes with long (and even infinite P3l ) memory. In this paper, we use the thermo- 
dynamic formalism to propose a generalisation of the existing models used for spike 
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trains statistics analysis which resuhs in a more powerful framework that overcomes 
some of the weaknesses mentioned above. Our results are grounded on well estab- 
lished theoretical basements (see e.g. E38l) completed by recent results of the authors 
dealing with collective spike trains statistics in neural networks models IfTSlfTTI . The 
theoretical framework of our approach is presented in the section |2] We propose a 
global approach to spike train analysis, going beyond the usual approaches essentially 
because it allows us to take into account (long term) memory effects in dynamics (sec- 
tions I2.ll2.2l l. As a matter of fact we deal with models considering spatio-temporal 
and time-causal structure of spike trains emitted by neural networks together with the 
fact that some spike sequences (or "words") might be forbidden by dynamics, intro- 
ducing the notion of grammar. We propose a spectral method which provides directly 
the "free energy density" and the Kullback-Leibler divergence between the empirical 
statistics and the statistical model (section l23T i. This method does not assume a specific 
potential form and allows us to handle correctly non-normalized potentials. It does not 
require the assumption of detailed balance (necessary to apply Markov Chain Monte- 
Carlo (MCMC) methods) and offers a control of finite-size sampling effects, inherent 
to empirical statistics, by using large deviations results (Section l274] i. The price to pay 
is to introduce a somewhat heavy, but necessary, mathematical formalism. In several 
places we make connections with existing methods to clarify these concepts. 

These theoretical basements allows us to propose, in section[3] a numerical method 
to parametrically estimate, and possibly compare, models for the statistics of simulated 
multi-cell-spike trains. Our method is not limited to firing rates models, pairwise syn- 
chronizations as ll66l l73l l72l or 1-step time pairwise correlations models as |44|, but 
deals with general form of Gibbs distributions, with parametric potentials correspond- 
ing to a spike n-uplets expansion, with multi-units and multi-times terms. The method 
is exact (in the sense that is does not involve heuristic minimization techniques). More- 
over, we perform fast and reliable estimate of quantities such as the KuUback- Leibler 
divergence allowing a comparison between different models, as well as the computa- 
tion of standard statistical indicators, and a further analysis about convergence rate of 
the empirical estimation and large deviations. 

In section ID we perform a large battery of tests allowing us to experimentally val- 
idate the method. First, we analyse the numerical precision of parameter estimation. 
Second, we generate synthetic data with a given statistics, and compare the estimation 
obtained using these data for several models. Moreover, we simulate a neural network 
and propose the estimation of the underlying Gibbs distribution parameters whose ana- 
lytic form is known ifTTIl . We also perform the estimation for several models using data 
obtained from a simulated neural network with stationary dynamics after Spike-Time 
dependent synaptic plasticity. Finally, we show results on the parameters estimation 
from synthetic data generated by a non-stationary statistical model. 
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2 Spike trains statistics from a theoretical perspective. 



2.1 General context 

2.1.1 Neural network dynamics. 

We consider the evolution of a network of neurons, described by a dynamical model, 
that is, either a deterministic dynamical system or a stochastic dynamical system (usu- 
ally governed by both a deterministic evolution map and additive noise). We assume 
that there is a minimal time scale 5, set to 5 = 1 without loss of generality, at which 
dynamics can be time-discretized. Typically, this can be the minimal resolution of the 
spike time, constrained by biophysics and by measurements methods (see f9l for a dis- 
cussion on time discretisation in spiking neural networks). The typical neuron models 
we think of are punctual conductance based generalized Integrate-and-Fire (IF) models 
with exponential synapses (gIF) IJ3I . Actually, the formalism developed here has been 
rigorously funded in Lil] for Leaky-Integrate-and-Fire (LIF) models with noise. We 
further assume the network parameters (synaptic weights, currents, etc..) to be fixed 
in this context (see fT3l for a discussion). This means that we assume observing a 
period of time where the system parameters are essentially constants. In other words, 
we focus here on stationary dynamics. This restriction is further discussed in section 

1333] 

We are interested in situations where neurons dynamics, and especially spikes oc- 
currences, do not show any regularity or exact reproducibility and require a statistical 
treatment. This is obviously the case for stochastic evolutions but this also happens in 
the deterministic case, whenever dynamics exhibits initial conditions sensitivity. This 
leads us to the choice of the statistical formalism proposed here, called the "thermody- 
namic formalisrrQ " (see fT2l| for an extended discussion). 

2.1.2 Dynamics and raster plots. 

Each neuron of index / = . . . — 1 is characterized by its state, Xi, which belongs to 
some (bounded) set y <E . M is the number of variables characterizing the state of 
one neuron (we assume that all neurons are described by the same number of variables). 
A typical example is M = 1 where Xj = V, is the membrane potential of neuron / and 
= [Vmin , Vmax] but the present formalism affords extensions to such additional char- 
acteristics as activation variables (e.g. for the Hodgkin-Huxley model [3l\M — 4). The 
variable X = [X,]^^' represents the state of a network of neurons. Without loss of 
generality, we assume that all neurons have the same properties so that X E ^ = , 
where ^ is the phase space where dynamics occurs. The evolution of the network 

over an infinite time is characterized by a trajectory X =^ {^(Olrtlloo- 

One can associate to each neuron / a variable fi), (f ) = 1 if neuron ; fires between 

[f ,f + I [ and (B,(f) = otherwise. A "spiking pattern" is a vector <»(? ) [Wi(0];!=o ^ ^ 

'This terminology has been introduced by Sinai |74|, Ruelle |64| and Bowen |5 | because of its analogy 
with statistical physics. But it does not relies on the principles of thermodynamics. Especially, the maxi- 
mization of statistical entropy, discussed below, does not requires the invocation of the second principle of 
thermodynamics . 
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which tells us which neurons are firing at time t. In this setting, a "raster plot" is a 

sequence © =^ {ft'(0}/t°loo' of spiking patterns. Finally a spike block is a finite set of 
spiking pattern, written: 

where spike times have been prescribed between the times fi to ti- 

To each trajectory X = {^(OlrtTloo is associated a raster plot CO = {co{t)}^J^_^. 
This is the sequence of spiking patterns displayed by the neural network when it fol- 
lows the trajectory X. We write X — > (B. On the other way round, we say that an 
infinite sequence CO — {co{t)}^J^_^ is an admissible raster plot if dynamics allows a 
trajectory X such that X CO. We call E the set of admissible raster plots. The dynam- 
ics of the neurons state induces therefore a dynamics on the set of admissible raster 
plots, represented by the left shift, a, such that aco = co' <^ co'{t) = co[t + l),Vf > 
. Thus, in some sense, raster plots provide a code for the trajectories X. Note that the 
correspondence may not be one-to-one 1 10 |. 

Though dynamics produces many possible raster plots, it is important to remark 
that it is not able to produce any possible sequence of spiking patterns. This depends 
on the system properties (e.g., refractoriness forbids raster plots with spike interval 
below \ms) and parameters (e.g., after synaptic weight adaptation the dynamics often 
appears more constrained). For example, inhibition may prevent a neuron to fire when- 
ever a group of pre-synaptic neurons has fired before. There are therefore allowed and 
forbidden sequences, constrained by dynamics. This corresponds to the following cru- 
cial property, often neglected in entropy estimations of spike trains |59|. The set of 
admissible raster plot E is not the set of all possible raster plots. Indeed, considering 
spike blocks of size n there are 2^" possible spike blocks but quite a bit less admissible 
raster plots (the exponential rate of growths in the number of admissible raster plots is 
given by the topological entropy which is an upper bound for the Kolmogorov-Sinai 
entropy defined in eq. (O, footnote|6]l. 



2.1.3 Transition probabilities. 

Typically, the network dynamics and the related spikes fluctuate in an unpredictable 
manner. The spike response itself is not sought as a deterministic response in this 
context, but as a conditional probability I.59J. "Reading out the code" consists of 
inferring such probability. Especially, the probability that a neuron emits a spike at 
some time t depends on the history of the neural network. However, it is impos- 
sible to know explicitely its form in the general case since it depends on the past 
evolution of all variables determining the neural network state X. A possible sim- 
plification is to consider that this probability depends only on the spikes emitted in 
the past by the network. In this way, we are seeking a family of transition prob- 



abilities of the form P 



Co{t) I [coJ'^J from which all spike trains statistical proper- 



ties can be deduced. These transition probabilities are called conditional intensity in 
1351121 [IS |36l|82]|5l] [81] 113 and they are essential to determine completely the spike 
trains statistics. The price to pay is that we have to consider processes with memory 
(which is not so shocking when dealing with neural networks). 
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These transition probabilities are unknown for most models but an explicit compu- 
tation can be rigorously achieved in the case of a discrete time Leaky Integrate-and-Fire 
(LIF) neural networks with noise, in the stationary case (e.g. time independent stim- 
ulus) (see eq. ( |40] i and fTTI ). Stationarity means here that the transition probability 
does not depend explicitely on f so that one can focus on transition probabilities of 

the form P G)(0) | [ft)]li and infer the probability of any spike block by the classi- 
cal Chapman-Kolmogorov equation ||27ll . To our knowledge this is the only example 
where the complete spike trains statistics can be rigorously and analytically computed. 
Note that the transition probability depends on a unbounded past in the LIF model. 
Indeed, the state of a neuron is reset whenever it fires, so the probability of a given 
spiking pattern at time depends on the past up to a time when each neuron has fired 
at least once. However, this time cannot be bounded (though the probability that it is 
larger than some t decays exponentially fast with t) ifTTIl . 



2.1.4 Gibbs distribution. 

As far as the present paper is concerned, the main result in 1 1 1 1 states that some neural 
networks models do have Gibbs distributions, though of a quite more complex form 
than currently used in the literature. More precisely it is rigorously proved in ifTTl that 
in discrete-time LIF modelfl with noise the statistics of spike trains is characterized 
by a Gibbs distribution which is also an equilibrium state, where the potential can be 
explicitely computed, but has infinite range. 

Let us be more explicit. Since we are using the terms "Gibbs distribution" and 
"equilibrium state" in a more general sense than the definition used in the neuroscience 
community for spike train statistics analysis, we give here the definition of these two 
terms. In several places in the paper we show the link between this formalism and 
the usual form, and explain why we need to use the present formalism for spike train 
analysis. The main difference is that we consider probability distributions on a set of 
spatio-temporal sequences where the "space" is the network, and where time is infinite 
so that the spike train probability distributions is defined on infinite time sequences 
0. This is the natural context when considering transition probabilities as introduced 
in the previous section. The price to pay is a more complex formulation than the 
classical 2exp(— j3//), but the reward is having a formalism allowing us to handle 
spike trains statistics including memory terms, and an explicit way to compute the free 
energy density and the Kullback-Leibler divergence between the empirical statistics 
and a statistical model, as developped in the rest of the paper 

A probability distribution pL^ on the set of infinite spike sequences L (raster plots) 
is a Gibbs distribution if there exists a functioio ^ : E — ^ R, called a potential, such 

^Without restriction on the synaptic weiglits except tliat tliey are finite. 

''This coiTesponds to the "thermodynamic limit" in statistical physics but in our case thermodynamic limit 
means "time tends to infinity" instead of "dimension of the system tends to infinity". As a matter of fact the 
number of neurons, N. is fixed and finite in the whole paper 

''Some regularity conditions, associated with a sufficiently fast decay of the potential at infinity, are also 
required, that we do not state here L38i . 
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that the probability of a spike block [o], , for any <t< +°°, and n > 0, obeys: 

([0)];+") 

where P(0),ci,C2 are some constants with < ci < 1 < C2- Recall that a is the shift 
on rasters defined in section 12.1.21 Basically, this expresses that, as n becomes large, 

M,([«r)behaveaiike!l||^. 

An equilibrium state is a probability distribution ji^ which satisfies the following 
variational principle: 

=/z(Ai^)+M^((/))= sup hin)+^{(^), (2) 

fiG;n(™')(I) 

where m^'"^'^ (^) is the set of invariant probability measures on E, /i (/x ) is the entrop}@ of 

def 

the probability /i, and /x(^) = / (j)dji is the average of with respect to the probability 
jj.. Note that the notion of Gibbs distribution and equilibrium state are not equivalent 
in general |38 1, but in the present context, they are lilli . 

The term P{^), called the topological pressure in this context, is the formal analog 
of a thermodynamic potential (free energy density). It is a generating function for the 
cumulants of (j) (see section 1223] for explicit examples). 



2,1.5 Gibbs potential. 

In the case of discrete-time LIF models the potential is the log of the probability 



transition P 



(o(t)\\(oY-l 



(TT]. We believe that this statement extends to more gen- 
eral situations: if a spike train is characterized by a Gibbs distribution then a natural 
candidate for the Gibbs potential is the log of the conditional intensity. Let us insist 
on this result. Beyond the mathematical intrincacies grounding this statement, this 
choice is natural because it provides a (time) causal potential with memory. As a con- 
sequence, the statistics of spikes at a given time are causaly related to the past spikes. 
This corresponds to potential having a range that can be large. A potential has range 
(K-i))- In terms of the transition probability, this corresponds 
to a system with a memory depth R — I (the probability that a neuron spike at time f 



^In the sense of {T).Thus,"behaves like" does not mean "is equal to". 
* The Kolmogorov-Sinai entropy or entropy rate of a probability ^ is: 

h[^]^ lim 'LJ!^^ (3) 

where 

AW[Ai] = - E M(Wr')l°gA'(M"')' 

E'"' being the set of admissible sequences of length n. This quantity provides the exponential rate of growth 
of admissible blocks having a positive probability under 11, as n growths. It is positive for chaotic system 
and it is zero for periodic systems. 
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depends on the spikes emitted by the network, up to time f — (/? — 1 ) back in the pasfl). 
Unfortunately even if the simplest known example of neural network model, the LIF, 
the range is (mathematically) infinit^l- Is the situation simpler for more complex neural 
networks models, for real neural networks ? Fortunately, finite range approximations 
can be proposed, with a good control on the degree of approximation, as we now de- 
velop. 



2.2 Markov approximations. 

In the sequel, we make the assumption that the spike trains statistics of the system 
that we are observing is described by a Gibbs distribution whose potential has to be 
determined from empirical data. 



2.2.1 Range-7J potential. 

It is always possible to propose Markov approximations of ^, even in the case where 
the Gibbs potential depends on spike sequences with unbounded length. This is the 
strategy that we now develop. We approximate the exact transition probability by a 



(0(0)1 [CO 



1 

-(R-l) 



In this 



transition probability with finite memory of depth R—1,P 

context, as shown in |Ti |, the exact Gibbs potential can be approximatec0by a range-R 
potential with a parametric form; 

Z=l(ii A ),..., (</A)e^(W,«), 

This form is nothing but a Taylor expansion of log(f fi)(0) | ), where one 

collects all terms of form cof^' (n,, ) . . . O,^' (n;, ), for integer k\^... ki's, using that cof («) = 
(Oi{n), for any A: > and any /,n. Here ^{N,R) is the set of non repeated pairs of 
integers (/,n) with / e {0, . . . ,A^ — 1} and n e {0 . . . ,R — 1}. 

Such form of potential is a linear combination of monomials. An order-n mono- 
mial is a product O)/, (fi) . . . COi„{t„), where < /i <i2 <•••<'«< A' — 1, < fi <t2< 
■ ■ ■ < t„ < oo and such that there is no repeated pair (ikJii), k — I ...n. The monomial 
COij (f 1 ) . . . cOi^, {t„) takes values in {0, 1 } and is 1 if and only if each neuron // fires at 
time ti, I — I ...n. On phenomenological grounds the monomial a),| (fi ) . . . fi),„ cor- 
responds to a spike n-uplet (/i .fi), . . . , (/„,f„) (neuron ii fires at time ti, neuron ij at 
time f2, etc ...). 



2.2.2 Further approximations. 

The potential Q remains quite cumbersome since the number of terms in (|6]l explodes 
combinatorially as N,R growth. Equivalently, in terms of the classical Jaynes approach 

'Hence range 1 or equivalently memory depth means time independent events. 
'^Though the variation of <f> decays exponentially fast ensuring the existence of a thermodynamic limit, 
'in the case of LIF models the Kullback-Leibler divergence between the exact Gibbs distribution and its 
approximation by the potential (5) decays exponentially fast with R. 
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where the Gibbs distribution is obtained via the maximisation of statistical entropy un- 
der constraints (see section l2.4.3l ). one has to fix a number of constraints that growths 
exponentially fast with N,R. As a consequence, one is typically lead to consider para- 
metric forms where monomials have been removed (or, sometimes, added) in the ex- 
pansion. This constitutes a coarser approximation to the exact potential, but more 
tractable from the numerical or empirical point of view. To alleviate notations we 
write, in the rest of paper, the parametric potential in the form: 

L 

V/=^A,0,, (6) 

1=1 

where (pi's are monomials. The choice of the parametric form defines what we call 
a "statistical model", namely a Gibbs distribution, denoted jXif, in the sequel, for the 
potential (|6]l. The question is "how far is this distribution from the true statistics" ? 

2.2.3 Examples of range-/? potentials 

Bernoulli potentials The easiest example of potential are range- 1 potentials (memo- 
ryless) where \j/{co) ~ L^^' ^i<^iiO)- The corresponding Gibbs distribution provides a 
statistical model where neurons are independent. 

"Ising" like potentials. This type of potential has been introduced by Schneidman and 
collaborators in ll67l . It reads, in our notations, 

N-l N-li-l 

W{co) = ^'^^'(0) + L L ^.M0)cOj{0). (7) 

!=0 1=0 j=0 

The corresponding Gibbs distribution provides a statistical model where synchronous 
pairwise correlations between neurons are taken into account, but neither higher order 
spatial correlations nor other time correlations are taken into account. As a conse- 
quence, the corresponding "Markov chain" is memoryless. 

Pairwise Time-Dependent-fc potentials with rates (RPTD-^). 

An easy generahzation of (|7]l is: 

Wico) = XMQ)+Y Y Y ^iJrC0i{Q)C0j{T), (8) 

;=0 ;=0 j=QT=-k 

called Pairwise Time-Dependent k (RPTD-k) with Rates potentials in the sequel. 

Pairwise Time-Dependent k (PTD-fc) potentials. 

A variation of dH) is to avoid the explicit constraints associated to firing rates : 

LL t ^Ur(0i{0)(0j{t), (9) 

1=0 j=OT=-k 

called Pairwise Time-Dependent k (PTD-k) potentials in the sequel. 
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2.2.4 Encoding spike blocks 

To each spike block of length R, [o)]^^^ k £7L, one can associate an integer: 

k+R-\N-l 

E2'+^'a),(0- (10) 

t=k i=0 

One has 2^* such possible blocks (though some of them can be forbidden by dynam- 
ics). 

We use the following notation: 

k+R N-l 

aw,= Y. E2'+*«Kf), (11) 

t=k+l i=0 

SO that, represents the block [o)]^^^ ' and aw/, = wi^^i represents the block [oj^ijif . 
In this setting a range-^ potential is therefore a vector in the space =^ IR^ with 

def 

components Yw — '/(®)- This amounts to recoding spiking sequences as sequences 
of spike blocks of length R, associated with words w^, taking into account the memory 
depth of the Markov chain. 



2.3 Determining the statistical properties of a Gibbs distribution. 

We now introduce the thermodynamic formalism allowing us to compute numerically 
the main statistical properties of a Gibbs distribution. This approach is different from a 
classical approach in statistical physics where one tries to compute the partition func- 
tion. The present approach gives directly the topological pressure (corresponding to the 
free energy density in the thermodynamic limit) from which the statistical properties 
can be inferred. 



2.3.1 The Ruelle-Perron-Frobenius operator. 

Once the parametric form of the potential is given, the statistical properties of the Gibbs 
distribution are obtained by the Ruelle-Perron-Frobenius (RPF) operator introduced by 
Ruelle in ||62'| . In the present case this is a positive 2^^^ x 2^^ matrix, , with 

entries 

iw',w(r)=e'^'"G^',w. , (12) 

(while it acts on functional spaces in the infinite range case). 

The matrix G is called the grammar. It encodes in its definition the essential fact 
that the underlying dynamics is not able to produce all possible raster plots: 

f 1, if the transition w' wis admissible; 
"'■"^"1 0, otherwise. 

Since we are considering blocks of the forrrFl 



'"since dynamics is assumed stationnary the result actually does not depend on k. 
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1) and w ~ [cl^+f = 0)(k+ I) ... 0}{k~R), the transition w' — ;> w is legal if^w'' and w 
have the spiking patterns (a{k + \ ) . . . (i){k + R — \ ) m common. Thus, while there are 
2^^^ blocks for a network of neurons, the matrix G has at most 2^ non zero entries 
on each line. As a consequence is sparse. 

Note also that all non zeroes entries L„i „{\j/) on a given line are equal. This de- 
generacy comes from our choice to represent as a vector in which is the easiest 
for numerical purposes. This has consequences discussed in the section 17. 3. 61 

2.3.2 The Ruelle-Perron-Frobenius theorem 

In the present paper we make the assumption that the underlying (and hidden) dynam- 
ics is such that the L{\j/) matrix is primitive, i.e. 3n > 0, s.t. Vw,w' L"j ^,,(1^) > 0. 
This assumption holds for Integrate-and-Fire models with noise and is likely to hold 
for more general neural networks models where noise renders dynamics ergodic and 
mixing ifTTl . Note, on the opposite, that if this assumption is not fulfilled there are little 
chances to characterize spike trains statistics with a (unique) Gibbs distribution. 
Then, L{\j/) obeys the Perron-Frobenius theoreno 

Theorem 1 L{\if) has a unique maximal and strictly positive eigenvalue s{\if) — e^'"''' 
associated with a right eigenvector b{\jf)) and a left eigenvector {b{\jf), with positive 
and bounded entries, such that L(^)b{'^/)) — s{\jf)b{\lf)), {b{\lf)L{\if) = s{\if){b{\lf). 
Those vectors can be chosen such that {b{\j/).b{\j/)) — 1 where . is the scalar product 
in The remaining part of the spectrum is located in a disk in the complex plane, of 
radius strictly lower than s(\j/). As a consequence, for all v in J^, 

-±^L"ixif)y^b{w)){b{w).y. (14) 

as n ^ °o. 

The Gibbs-probability of a spike block w of length R is 

H^{w)=b„{\i/)){b„{\i/), (15) 
where b„{\j/)) is the w-th component of b{\j/)). 

As a consequence, the assumption of primitivity guarantees the existence and unique- 
ness of a Gibbs distribution. Note that it is more general than the detailed balance 
assumption. 

2.3.3 Computing averages of monomials 

Since ^^[^1] = LivMv/[H0/(^) one obtains using (fTSl l: 

Mv/[^/]= L b„{y/)}(l>i{w){b„{Y)- (16) 

This provides a fast way to compute 

"Additional transitions are usually forbidden by dynamics. As a consequence, those transitions have a 
zero probability of occurence and they can be detected on empirical sequences (see section [3.4. U . 

'^This theorem has been generalized by Ruelle to infinite range potentials under some regularity conditions 
(63] IM). 
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2.3.4 The topological pressure. 

The RPF theorem gives a direct access to the topological pressure P{y/) which is the 
logarithm of the leading eigenvalue s^Xj/), easily obtained by a power method (see eq. 
(O). In the case of range-/? potentials Q where the topological pressure P(v^) be- 
comes a function of the parameters A = we write One can show that the 
topological pressure is the generating function for the cumulants of the monomials 0/: 

'-^-^vm- (17) 

Higher order cumulants are obtained likewise by successive derivations. Especially, 
second order moments related to the central limit theorem obeyed by Gibbs distribu- 
tions lis] [38l are obtained by second order derivatives. As a consequence of this last 
property, the topological pressure's Hessian is positive and the topological pressure is 
convex with respect to A . 

2.3.5 Entropy 

Since jj.^ is a Gibbs distribution, for the potential xif, therefore, an exact expression for 
the Kolmogorov-Sinai entropy Q can be readily obtained: 

h[^^]=pix)-Y,x,^^m. (18) 

2.3.6 Normalisation 

When switching from the potential (|5]l, which is the polynomial expansion of the log 
of the conditional intensity, to a simplified parametric form (|6]l, one introduces several 
biases. First, one may add terms which are not in the original potential. Second, Q 
must satisfy a constraint corresponding to the fact that i^(g)) is the log of a probability. 
Such a potential is called normalized. Its main characteristics are (i) the topological 
pressure is zero; (ii) the right eigenvector b{\lf)) has all its components equal. The 
reason is simple: when the potential is the log of a transition probability the RPF oper- 



ator satisfies T,,,e._^L,/,, = La,{Q)e{o.if ^ 



corn [CO] 



-1 



l,Vw'e^, where w' 



corresponds e.g. to [o]!^, and w to [co]^_i^j^_yy Thus, the largest eigenvalue s{y/) is 1, 
and the corresponding right eigenvector has all its components equal. 

On the opposite, the parametric form ^ where A/ are free parameters is in gen- 
eral not normalized, with deep consequences discussed in the next sections. However, 
there exists a transformation allowing to convert an arbitrary range-/? potential to a 
normalized potential ^ by the transformation: 

I'w'H- = -log(Z7,„,,(v/)» +log(/7,,(v/))) -/'(V^). (19) 

Let us give two examples. First, if \j/ is normalized then b„t{^)) — bn-{yf)) and 
P{y/) —0 so that (fortunately) ^ ~ y/. Second, if y/ has range- 1 then, according to 
the computations done in section lZTSl ^^.'^ — — log(Z) + Xj/^^./ = — log(Z) + \i/{(o{Q)). 
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Here, the normalisation only consists of removing the log of a (constant) partition 
function. 

In the general case, the potential ( fT9] l has range-/? + 1 . The corresponding RPF op- 
erator L{^), is therefore the transition matrix for a /?-step Markov chain. Thus, switch- 
ing to a parametric form (|6]l without constraint on the A/'s we end up with a redundant 
transition probability of form P' (a)(0) | [0]!^) while the right transition probability is 

P{(0{0) \ [(0]zIr_i))- Since, obviously P'(to(0) I f(«(0) | [(o]zIr_i)) the final 

form of the normalized potential can be easily simplified. 

2.3.7 Probability of arbitrary spike blocks 

Using the normalized potential ( fT9] l the probability of a admissible spike block of size 
strictly larger than R, [o)]'^"^'^"', f e « > is given by: 



IXx^ [C0{t),C0{t+l)...C0{t + n+R-l)]^tit^[wt,...,Wt+n] 



where the word encodes the block [ci)]|^^^ As a consequence, 

jj.^ [wt,. . . ,w,+„] = ;Uy[wr]Lvv„H.-,+i Cl')Lw,+i,w,+2('P) ■ ■ ■'L„.,+„_i,H.,+„CP)- (20) 

This is the classical Chapman-Kolmogorov equation. Returning to the initial (non- 
normaUsed) potential (|6]l this relation reads, using ( fT9] l: 

M,K.^..,.,J (21) 

One check^ that jj.-^ [wt,... ,Wt+„] satisfies the definition of a Gibbs distribution 
([B with P{y(f) = logs{\if) and Wk = g'^wq- 

On the opposite, for blocks of size < n < R+l then 

WC[0>]'+" 

where the sum holds on each word w containing the block [g)]J^". 



2.3.8 Links with the simple Gibbs form. 

In this section we make the link between our formalism and previous approaches using 
the simple Gibbs formulation. 

As a preliminary remark note that the Gibbs-probability of a spike block w of length 
R, given by (fTST i, hasn 't the form ig'^'t"') , with Z constant, except when R=\. The case 
R — I corresponds to a Markov chain without memory, where therefore the spiking 
pattern w, = a)(f) is independent on w,_i — co{t — 1). Examples are the Bernoulli 
model (where moreover spikes are spatially independent) or the Ising model (where 
spikes are spatially correlated but not time correlated). In this case, all transitions 

'^Taking into account the fact that symbols n> encode spike blocks of length R. 
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are allowed, thus the RPF matrix reads L^^J ^^,{y/) = e^'^' and does not depend on w. 
As a consequence, all lines are linearly dependent which implies that there are — 

1 0-eigenvalues while the largest eigenvalue is Z'^ s(\(f) — Tr(L{\(f)) — ^ e'''"'. 

The corresponding left eigenvector is {b{y/) — (1, . . . , 1) and the right eigenvector is 
bw'iw)) = ^° — 1- Thus, the Gibbs distribution is, according to 

Consider now larger ranges. Recall first that a potential of form (|6]l is in gen- 
eral not normalized. To associate it to a Markov chain one has to use the trans- 
formation ( fT9] l and the probability of a spiking pattern sequence is given by dSTT l. 
In particular, the joint probability of two admissible successive blocks W ,w reads 
/ii(,(w, w') = My ['^1 'h'j('^)) PT?!^'^'"' ■ introduce a formal Hamiltonian H„„i = 

y/^i + log(/7H'('/))) and a "conditional" partition function Z(w') = e^'^^^by,/{\if)) such 
\h2LtjXxf,{w\W) = 2^^^'™' withZ(w') =Lvi'G^^^'™' but here the partition function de- 
pends on W (compare with eq. (1) in ref ll44l ). This corresponds, in statistical mechan- 
ics, to have interactions with a boundary. In this setting, the free energy density (topo- 
logical pressure) is obtained (away from phase transitionj*^. via - logZ„(w') P{w) 
as « — > oo, Vw', requiring to consider a thermodynamic limit, as we do in the present 
setting. 

As a conclusion, starting from an a priori form of a potential obtained e.g. by 
Jaynes argument (see section lZ4.3l l one obtains a non normalized potential which can- 
not be directly associated with a Markov chain, and the corresponding Gibbs measure 
hasn't the simple Gibbs form used for Ising model, as soon as one introduces memory 
terms in the potential. However, the thermodynamic formalism allows one to treat this 
case without approximations, or assumptions such as detailed balance, and gives direct 
access to the topological pressure. 



2.3.9 Comparing several Gibbs statistical models. 

The choice of a potential (|6]l, i.e. the choice of a set of observables, fixes a statistical 
model for the statistics of spike trains. Clearly, there are many choices of potentials and 
one needs to propose a criterion to compare them. The KuUback-Leibler divergence. 



d{v,ii) 



:limsup- ^([®]o O^og 



(22) 



where v and /i are two invariant probability measures, provides some notion of asym- 
metric "distance" between /i and v. 

The computation of d{v,pL) is delicate but, in the present context, the following 
holds. For v an invariant measure and /x^/ a Gibbs measure with a potential i^, both 
defined on the same set of sequences E, one has |5 63. 38. 16J : 



d{v,fi^) =P(v/)-v(v/-)-/z(v). 



(23) 



'''This requires a sufficiently fast decay of the potential, as mentioned in the footnote|4] 
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This is the key of the algorithm that we have developed. 

2.4 Computing the Gibbs distribution from empirical data. 

2.4.1 Empirical Averaging 

Assume now that we observe the spike trains generated by the neural network. We 
want to extract from these observations informations about the set of monomials (pi 
constituting the potential and the corresponding coefficients A;. 

Typically, one observes, from ^ repetitions of the same experiment, i.e. submit- 
ting the system to the same conditions, ,yK raster plots O)'™' , w = 1 . . . ^ on a finite 
time horizon of length T. These are the basic data from which we want to extrapolate 
the Gibbs distribution. The key object for this is the empirical measure. For a fixed ^ 
(number of observations) and a fixed T (time length of the observed spike train), the 
empirical average of a function / : E — > M is: 

f^^^'^ = 4^Ltfi<y'^^'"^ (24) 

m=lr=l 

Typical examples are f{(o) = (Oi{Q) in which case the empirical average of / is the 
firing rat j*^ of neuron /; /(o) = 0, (0)0)^(0) then the empirical average of / measures 
the estimated probability of spike coincidence for neuron j and /; f{(o) — (Oi{z)(Oj{Q) 
then the empirical average of / measures the estimated probability of the event "neuron 
j fires and neuron / fires T time step later" (or sooner according to the sign of t). 

Note that in (l24l i we have used the shift a' for the time evolution of the raster plot. 
This notation is compact and well adapted to the next developments than the classical 
formula, reading, e.g., for firing rates -p^ T.''m=i TJ=i /{co^'"^ (?))■ 

The empirical measure is the probability distribution ;r'^' such that, for any func- 
tioi{3/:i:^R, 

^m^f^^ji.r.T)^ (25) 
Equivalently, the empirical probability of a spike block [0)]'^ is given by: 

'''^'([<)=i?EE^K;«''<°""''' 

m=lr=l ^ 

where Z|g,]'2 is the indicatrix function of the block [ojj^ so that L^Li Z[g,]'2 (c'ft)''"') 
simply counts the number of occurences of the block [coj^j in the empirical raster co^™' . 

2.4.2 Estimating the potential from empirical average 

The empirical measure is what we get from experiments while it is assumed that spike 
statistics is governed by an hidden Gibbs distribution jiif, that we want to determine or 
approximate. Clearly there are infinitely many a priori choices for this distribution. 

Recall that we assume dynamics is stationary so rates do not depend on time. 
'*In fact, it is sufficient here to consider monomials. 
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corresponding to infinitely many a priori choices for the potential xf/. However, the 
ergodic theorem (the law of large number) states that tt'^' — > /i^ as T — > oo where 
jJ.^, is the sought Gibbs distribution. Equivalently, the Kullback-Leibler divergence 

d ^K^^\ny,^ between the empirical measure and the sought Gibbs distribution tends 
to as r — ^ oo. 

Since we are dealing with finite samples the best that we can expect is to find a 
Gibbs distribution which minimizes this divergence. This is the core of our approach. 
Indeed, using eq. ( |23] ) we use the approximatiorPl: 

rf(;r(^),M^)=P(V/)-;r(^'(v/)-/!(;r(^'). (27) 

The advantage is that this quantity can be numerically estimated, since for a given 
choice of y/ the topological pressure is known from the Ruelle-Perron-Frobenius the- 
orem, while 7t'^^\\jf) is directly computable. Since ;r'^' is fixed by the experimental 
raster plot, /i(7r'^') is independent of the Gibbs potential, so we can equivalently mini- 
mize: 

h[w]=P[v]-^^^Hv), (28) 

without computing the entropy /z(7r'^'). 

This relation holds for any potential. In the case of a parametric potential of the 
form (|6]l we have to minimize 

h[l]^P[X]-Y^l,K^^\^i). (29) 

/=i 

Thus, from ( fTTI i and (|24] |. given the parametric form, the set of A/'s minimizing the KL 
divergence are given by: 

Mv.M-;r<^' (</>;), l=l...L. (30) 

Before showing why this necessary condition is also sufficient, we want to comment 
this result in connection with standard approaches ("Jaynes argument"). 

2.4.3 Inferring statistics from empirical averages of observables: The Jaynes 
argument. 

The conditions ( l30l l impose constraints on the sought Gibbs distribution. In view of the 
variational principle (|2]l the minimization of KL divergence /or a prescribed paramet- 
ric form of the Gibbs potential is equivalent to maximizing the statistical entropy under 
the constraints diOl ). where the A/'s appear as adjustable Lagrange multipliers. This is 
the Jaynes argument lf34ll commonly used to introduce Gibbs distributions in statisti- 
cal physics textbooks, and also used in the fundating paper of Schneidman et al. Il67l . 
There is however an important subblety that we want to outline. The Jaynes argument 
provides the Gibbs distribution which minimizes the KL divergence with respect to the 
empirical distribution in a specific class of Gibbs potentials. Given a parametric form 
for the potential it gives the set of A/'s which minimizes the KL divergence for the set 

'^This is an approximation because n^^"^ is not invariant (38). It becomes exact as 7 — > +t». 
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of Gibbs measures having this form of potential fT\\. Nevertheless, the divergence can 
still be quite large and the corresponding parametric form can provide a poor approxi- 
mation of the sought measure. So, in principle one has to minimize the KL divergence 
with respect to several parametric forms. This is a way to compare the statistical mod- 
els. The best one is the one which minimizes ( |29] l, i.e. knowing if the " model"i/2 is 
significantly "better" than y/j, reduces to verifying: 

/i[V/2]«/^[ri], (31) 

easily computable at the implementation level, as developed below. Note that h has the 
dimension of entropy. Since we compare entropies, which units are bits of information, 
defined in base 2, the previous comparison units is well-defined. 

2.4.4 Convexity. 

The topological pressure is convex with respect to A. As being the positive sum of 
two (non strictly) convex criteria P [i/a] and —7t^^^ {\jf) in ( |29] ), the minimized criterion 
is convex. This means that the previous minimization method intrinsically converges 
towards a global minimum. 

Let us now consider the estimation of an hidden potential \f/ = ^^^j A/0/ by a test 

potential i/a^'*^*') = Y4L1 ' (pf""'^ ■ As a consequence, we estimate i// with a set of 

parameters Xl-''"''\ and the criterion ( |29] l is minimized with respect to those parameters 

Several situations are possible. First, \j/ and xj/'^"""^ have the same set of mono- 
mials, only the A/'s must be determined. Then, the unique minimum is reached for 
^(tuit) — i — 1 . . Second, contains all the monomials of \j/ plus additional 

ones (overestimation). Then, the A^^'^^'^'s corresponding to monomials in y/ converge 
to A/ while the coefficients corresponding to additional monomials converge to 0. The 
third case corresponds to underestimation. contains less monomials than \j/ or 

distinct monomials. In this case, there is still a minimum for the criterion ( |29] |. but 
it provides a statistical model (a Gibbs distribution) at positive KL distance from the 
correct potential ||2TI . In this case adding monomials to i/a^'*^*') will improve the estima- 
tion. More precisely, if for a first test potential the coefficients obtained after minimi- 
sation of h are Xj-"""'^ ,1 — I ... and for a second test potential they are A^ , / = 
1... l' l'C-') >L('-') then/i(Af"''\...,A^;f;;,|) >KaI*'"''\...,A^;^^^^^^ For the 

same / the coefficients Xj"''"^ and A, ^"""'^ can be quite different. 

Note that these different situations are not inherent to our procedure, but to the 
principle of finding an hidden probability by maximizin g th e statistical entropy under 
constraints, when the full set of constraints is not known Examples of these cases 
are provided in section |4] As a matter of fact, we have therefore two strategies to 

"*The problem of estimating the memory order of tlie underlying markov chain to a given sequence, which 
means, in our framework, to find the the potential range, has been a well known difficult question in coding 
and information theory |49|. Some of the current available tests might offer additional algorithmic tools that 
would be explored in a forthcoming paper 



18 



estimate an hidden potential. Either starting from a minimal form of test potential (e.g. 
Bernoulli) and adding successive monomials (e.g. based on heuristical arguments such 
as "pairwise correlations do matter") to reduce the value of h. The advantage is to 
start from potentials with a few number coefficients, but where the knowledge of the 
coefficients at a given step cannot be used at the next step, and where one has no idea on 
"how far" we are from the right measure. The other strategy consists of starting from 
the largest possible potential with range • In this case it is guarantee that the test 
potential is at the minimal distance from the sought one, in the set of range-/? potential, 
while the minimization will remove irrelevant monomials (their coefficient vanishes 
in the estimation). The drawback is that one has to start from a very huge number of 
monomials (2^^) which reduces the number of situations one can numerically handle. 
These two approaches are used in section|4] 

2.4.5 Finite sample effects and large deviations. 

Note that the estimations crucially depend on T. This is a central problem, not inher- 
ent to our approach but to all statistical methods where one tries to extract statistical 
properties from finite empirical sample. Since T can be small in practical experiments, 
this problem can be circumvented by using an average over several samples (see eq. 
(l24l i and related comments). Nevertheless it is important to have an estimation of finite 
sampling effects, which can be addressed by the large deviations properties of Gibbs 
distributions. 

For each observable (pi, I = I ...L, the following holds, as T — > +00 ||22l : 

^^{(0, \n^^\^i) - Mr (</>/) I > e} ^ exp(-r/,(£)), (32) 

where //(x) = sup;^^^^^ [Xix — P[X]), is the Legendre transform of the pressure P [X]. 

This result provides the convergence rate with respect to T. This is very important, 
since, once the Gibbs distribution is known, one can infer the length T of the time win- 
dows over which averages must be performed in order to obtain reliable statistics. This 
is of particular importance when applying statistical tests such as Neymann-Pearson 
for which large deviations results are available in the case of Markov chains and Gibbs 
distributions with finite range potentials ll50l . 

Another important large deviation property also results from thermodynamic for- 
malism lt38l[T4ll22l . Assume that the experimental raster plot (O is distributed according 
to the Gibbs distribution {J.^/, with potential and assume that we propose, as a statis- 
tical model, a Gibbs distribution with potential \j/ ^ The Gibbs measure of spike 

blocks of range ( fTsT i is a vector in and ;r(^' = ( ;r^^^(w)] ) is a random vec- 

tor. Now, the probability jU|^|||;r'^' — < th^t ;r'^' is e-close to the "wrong" 
probability ii^i decays exponentially fast as, 

Hy,{\\7z''^'^-liy\\<e}^exp{-T inf d{^i,n^))). (33) 

''ibid. 
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Thus, this probability decreases exponentially fast with T, with a rate given (for 
small e) by T d{lJ.^/,lJ.y,i). Therefore, a difference of rj in the KuUback-Leibler diver- 



Consequently, for T ^ 10^ a divergence of order 77 = 10 ^ leads to a ratio of order 
exp(— 10). Illustrations of this are given in section]!] 

2.4.6 Other statistics related to Gibbs distributions. 

The K-L divergence minimization can be completed with other standard criteria for 
which some analytical results are available in the realm of Gibbs distributions and ther- 
modynamic formalism. Fluctuations of monomial averages about their mean are Gaus- 
sian, since Gibbs distribution obey a central limit theorem with a variance controlled 
by the second derivative of P{X). Then, using a test seems natural. Examples are 
given in section |4l In order to compare the goodness-of-fit (GOF) for probability dis- 
tributions of spike blocks, we propose at the descriptive level the box plots tests. On 
the other hand, quantitative methods to establish GOF are numerous and can be clas- 
sified in families of 'test Statistics', the most important being the Power-Divergence 
methods (eg. Pearson-;);^ test), the Generalized Kolmogorov-Smirnov (KS) tests (eg. 
the KS and the Watson-Darling test) and the Phi-Divergence methods (eg. Cramer-von 
Mises test) ll20l[T7l . Finally, to discriminate 2 Gibbs measures one can use the Neyman- 
Pearson criteria since large deviations results for the Neyman-Pearson risk are available 
in this case |50|. In the present paper we have limited our analysis to the most standard 
tests (diagonal representations, box plots, X^)- 

3 Application: parametric statistic estimation. 

Let us now discuss how the previous piece of theory allows us to estimate, at a very 
general level, parametric statistics of spike trains. 

We observe neurons during a stationary period of observation T, assuming that 
statistics is characterized by an unknown Gibbs potential of range R. The algorithmicF^ 
procedure proposed here decomposes in three steps: 

1. Choosing a statistical model, i.e. fixing the potential ^ (equivalently, the rele- 
vant monomials or "observables"). 

2. Computing the empirical average of observables, i.e. determine them from the 
raster, using eq. (l24l i. 

3. Performing the parametric estimation, i.e. use a variational approach to deter- 
mine the Gibbs potential. 

Let us describe and discuss these three steps, and then discuss the design choices. 

-"The code is available athttp://enas.gforge.inria. f r /classGibbsPotential . html 




;iy{||;r(^)-fi,,||<£} 



of order exp — Ttj. 
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3.1 Choosing a model: rate, coincidence, spiking pattern and more. 

3.1.1 The meaning of monomials. 

In order to understand the power of representation of the proposed formaHsm, let us 
start reviewing a few elements discussed at a more theoretical level in the previous 
section. 

We start with a potential limited to a unique monomial. 

• If 1/ = 0,(0), its related average value measures the firing probability ot firing 
rate of neuron /; 

• If V^(ft)) = 0, (0) ft); (0), we now measure the probability of spikes coincidence for 
neuron j and /, as pointed out at the biological level by, e.g, [28] and developed 
by 1661; 

• If i^(o) = ft),(T) Oj(0), we measure the probability of the event "neuron j fires 
and neuron / fires t time step later" (or sooner according to the sign of t); in 
this case the average value provide^ the cross-correlation for a delay T and the 
auto-correlation for / = j; 

• A step further, if, say, ^{(o) — 0,(0)0^(0)0)^(1), we now take into account 
triplets of spikes in a specific pattern (i.e. one spike from neuron / coinciding 
with two successive spikes from neuron j); 

These examples illustrate the notion of "design choice": the first step of the method 
being to choose the "question to ask", i.e. what is to be observed over the data. In this 
framework, this translates in: "choosing the form of the potential". Let us enumerate a 
few important examples. 

3.1.2 Taking only rate or synchronization into account: Bernoulli and Ising po- 
tentials. 

Rate potential are range- 1 potentials, as defined before. Such models are not very 
interesting as such, but have two applications: they are used to calibrate and study 
some numerical properties of the present methods, and they are also used to compare 
the obtained conditional entropy with more sophisticated models. 

Ising potentials have been introduced by Schneidman and collaborators in | [67l . 
taking rate and synchronization of neurons pairs, as studied in, e.g. |28|. This form is 
justified by the authors using the Jaynes argument. 

Let us now consider potentials not yet studied, up to our best knowledge, in the 
present literature. 

3.1.3 Taking rate and correlations into account: RPTD-fc potentials. 

This is a key example for the present study. On one hand, the present algorithmic 
was developed to take not only Bernoulli or Ising-like potential into account, but a 

Substracting the firing rates of i and j. 
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large class of statistical model, including a general second order model (redundant 
monomial being eliminated), i.e. taking rate, auto-correlation (parametrized by A,t) 
and cross-correlation (parametrized by ) into account. 

Being able to consider such type of model is an important challenge, because it 
provides a tool to analyze not only synchronization between neurons, but more general 
temporal relations (see e.g. Il23ll28l l6l for important applications). 

Let us now turn to a specific example related to the neuronal network dynamic 
analysis. 

3.1.4 Taking plasticity into account: "STDP" potentials 

In lfT2l we considered Integrate-and-Fire neural networks with Spike-Time Dependent 
Plasticity of type: 



Y T+Ts Is 

raWij+j E ^J^f) E f{u)0},{t + u) 

t=Ts u= — Ts 



(34) 



where Wij is the synaptic weight from neuron /' to neuron /, — 1 < r^/ < a term corre- 
sponding to passive LTD, T a large time, corresponding to averaging spike activity for 
the synaptic weights update, and, 

A_e~, x<0, A_<0; 

= A+e"^, x>0, A+>0; 
0, x = Q; 

with A- < and A+ > 0, is the STDP function as derived by Bi and Poo E). The 
shape of / has been obtained from statistical extrapolations of experimental data. 

def 

Ts = 2max(T+,T_) is a characteristic time scale. We argued that this synaptic weight 
adaptation rule produces, when it has converged, spike trains distributed according to 
a Gibbs distribution with potential: 

V/(a))^E^/')a),(0)+E'E'^f E /(«)«,(0)a);(«). (35) 

i=0 i=0 ;=() u=-Ts 

When considering a large number of neurons, it becomes difficult to compute and 
check numerically this joint probability over the whole population. Here, we propose 
to consider a subset of Ns < N neurons. In this case, the effects of the rest of 
the population can be written as a bulk term modulating the individual firing rates and 
correlations of the observed population, leading to a marginal potential of the form: 

w^Sco)^ E^/"«'(o)+ E L^iP E mcomcojiu). 06) 

Here, the potential is a function of both past and future. A simple way to embed this 
potential in our framework, is to shift the time by an amount of Ts, using the stationarity 
assumption. 
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3.1.5 The general case: Typical number of observed neurons and statistics range. 

The previous piece of theory allows us to take any statistics of range R, among any 
set of neurons into account. At the numerical level, the situation is not that simple, 
since it appears, as detailed in the two next sections, that both the memory storage 
and computation load are in (9(2^*), except if the grammar is very restrictive, and the 
possible spike pattern blocks very sparse. Hopefully, we are going to see that estimation 
algorithms are rather efficient, thus do not lead to a complexity larger than 0(2^'^). 

It is clear that the present limitation is intrinsic to the problem, since we have 
at least, for a statistics of range R, to count the number of occurrences of blocks of 
neurons of size R, and there are (at most) 2^^^ of them. Fastest implementations 
must be based on the partial observation of only a subset of, e.g., the most preeminent 
occurrences. 

Quantitatively, we consider "small" values of and R, typically a number of neu- 
rons equal to N G {1,— 8}, and Markov chain of range R = {1,— 16}, in order to 
manipulate quantities of dimension < 8, and R < 16, and such that N{R + 1) < 18. 



3.2 Computing the empirical measure: prefix-tree construction. 

For one sample = l),the empirical probability ( |25] | of the block [ft)]'_£, , D <t <Q 
is given by 

thus obtained counting the number of occurrence # [co]'_o , -D < f < of the block 
in the sequence [oj^^j-. Since we assume that dynamics is stationary we have. 

We observe that the data structure size has to be of order 0{2^^) (lower if the dis- 
tribution is sparse), but does not depends on T. Since many distributions are sparse (not 
all blocks occur, because the distribution is constrained by a grammar), it is important 
to use a sparse data structure, without storing explicitly blocks of occurence zero. 

Furthermore, we have to study the distribution at several ranges R and it is important 
to be able to factorize these operations. This means counting in one pass, and in a 
unique data structure, block occurrences of different ranges. 

The chosen data structure is a tree of depth R + 1 and degree 2^. The nodes at 
depth D count the number of occurrences of each block [a)]'_Q_|_,, of length up to D < 
R + 10 It is known (see, e.g., ||29J for a formal introduction) that this is a suitable data 
structure (faster to construct and to scan than hash-tables, for instance) in this context. 
It allows to maintain a computation time of order 0{TR), which does not depends on 
the structure size. 

3.2.1 The prefix-tree algorithm. 

Since we use such structure in a rather non-standard way compared to other authors, 
e.g. 1291 l26l . we detail the method here. 



The code is available athttp://enas.gforge.inria.fr/classSuffixTree. html. 
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We consider a spike train or_j, where time is negative. The prefix-tree data struc- 
ture for the present estimation procedure is constructed iteratively. 

1. Each spiking pattern at time f, o(f ), is encoded by an integer M'(f). 

2. This given, before any symbol has been received, we start with the empty tree 
consisting only of the root. 

3. Then suppose for —D < f < that the tree 3/'{[(O^Zt) represents [coJ'jjJ. One 
obtains the tree £/'{[co\_j) as follows: 

(a) One starts from the root and takes branches corresponding to the observed 
symbols (o{t -D+ 1), • • •, (o(t). 

(b) If ones reaches a leaf before termination, one replaces this leaf by an inter- 
nal node and extends on the tree. 

(c) Each node or leaf has a counter incremented at each access, thus counting 
the number of occurrence # , —D < f < of the block [(»]_£, in the 
sequence [0)]° j-- 

The present data structure not only allows us to perform the empirical measure 
estimation over a period of time T , but can also obviously be used to aggregate several 
experimental periods of observation. It is sufficient to add all observations to the same 
data structure. 

3.2.2 Generalization to a sliding window. 

Though we restrict ourselves to stationary statistics in the present work, it is clear that 
the present mechanism can be easily generalized to the analysis of non-stationary data 
set, using a sliding window considering the empirical measure in [f , f + T [, then [f + 
1 , f + 1 + r [, etc. . This is implemented in the present data structure by simply counting 
the block occurrences observed at time t and adding the block occurrences observed 
at time T, yielding a minimal computation load. The available implementation has 
already this functionality (see section |4.3.5| for an example). 

3.3 Performing the parametric estimation 

In a nutshell, the parametric estimation reduces to minimizing ( |27] |. by calculating the 
topological pressure P{'^) = P{^) using (fl4b and the related theorem. The process 
decomposes into the following steps. 

3.3.1 Potential eigen-elements calculation. 

It has been shown in the theoretical section that the Ruelle-Perron-Frobenius operator 
eigen-elements allows one to derive all characteristics of the probability distribution. 
Let us now describe at the algorithmic level how to perform these derivations. 
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1. The first step is to calculate the right-eigenvector b(v/)) of the L{y/) operator, 
associated to the highest eigenvalue, using a standard power-method series: 

= ||L(v/)v("-i)|| 
vW = -^L(v/)v("-i) 

where v'"' is the n-th iterate of an initial vector v*^"' and s'"' is the n-th iterate 
of an initial real value s^^\ With this method the pair (s'^"',v'"') converges to 
(i(v),b(v))) as soon as v'"' is not orthogonal to h{\j/)). In our case, after some 
numerical tests, it appeared a good choice to either set v'"' to an uniform value, 
or to use the previous estimated value of b(i/)), if available. This last choice is 
going to speed up the subsequent steps of the estimation algorithm. 

The key point, in this iterative calculation, is that L{y/) is (hopefully) a sparse 
2^^^ X 2^^ matrix, as outlined in the section 12". 3. II As a consequence calculating 
L(v/) V is a 0(2'^^^'^) <^ 0{2^^^) operation, making explicit the grammar in the 
implementation. 

The required precision on {s{^f) ,\i{'^f))) must be very high, for the subsequent 
steps to be valid, even if the eigenvector dimension is huge (it is equal to 2^^^), 
therefore the iteration must be run down to the smallest reasonable precision 
level (10^^^ in the present implementation). 

We have experimented that between 10 to 200 iterations are required for an initial 
uniform step in order to attain the required precision (for NR G 2.. 20), while less 
than 10 iterations are sufficient when starting with a previously estimated value. 

From this 1st step we immediately calculate: 

(a) The topological pressure = log{s{y/)). 

(b) The normalized potential ^'n. (this normalized potential is also stored in a 
look-up table). This gives us the transition matrix, which can be used to 
generate spike trains distributed according the Gibbs distribution /i^ and 
used as benchmarks in the section |4l 

2. The second step is to calculate the left eigenvector (b(i/), this calculation having 
exactly the same characteristics as for b{y/)). 

From this 2nd step one immediately calculates: 

(a) The Gibbs probability of a block w given by ( fTsT l. from which probabilities 
of any block can be computed (section |2.3.7| ). 

(b) The theoretical value of the observables average fX\f/{^i), as given in ( fT6] l. 

(c) The theoretical value of the distribution entropy h [/Xy/] , as given in ( fTsT l. 

After both steps, we obtain all useful quantities regarding the related Gibbs dis- 
tribution: probability measure, observable value prediction, entropy. These algo- 
rithmic loops are direct applications of the previous piece of theory and show the 
profound interest of the proposed framework: given a Gibbs potential, all other 
elements can be derived directly. 
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3.3.2 Estimating the potential parameters. 

The final step of the estimation procedure is to find the parameters A such that the 
Gibbs measure fits at best with the empirical measure. We have discussed why mini- 
mizing ( |27] | is the best choice in this context. Since /i(7r'^^') is a constant with respect 
to A, it is equivalent to minimize h [y/x] eq. ( l29l ). where jUi;? ((/>/) is given by (fTSl Equiv- 
alently, we are looking for a Gibbs distribution |J.^, such that '^'^^^^^ = ;r(^'(0;) which 
expresses that Tt''^^ is tangent to P at 1381 . 

3.3.3 Matching theoretical and empirical observable values. 

As pointed out in the theoretical part, the goal of the estimation is indeed to find the 
parameters A for which theoretical and empirical observable values match. The impor- 
tant point is that this is exactly what is performed by the proposed method: minimizing 
the criterion until a minimum is reached, i.e. until the gradient vanishes corresponding 
to a point where = ;r'^'(0/), thus where theoretical and empirical observable 

values are equal. Furthermore, this variational approach provides an effective method 
to numerically obtain the expected result. 

At the implementation level, the quantities 7t^'^\<j)i) are the empirical averages of 
the observables, i.e. the observable averages computed on the prefix tree. They are 
computed once from the prefix tree. For a given A , P(A ) is given by step 1 .a of the pre- 
vious calculation, while is given by the step 2.b. It is thus now straightforwarcQ 
to delegate the minimization of this criterion to any standard powerful non-linear min- 
imization routine. 

We have implemented such a mechanism using the GSL0 implementation of non- 
linear minimization methods. We have also made available the GSL implementation 
of the simplex algorithm of Nelder and Mead which does not require the explicit com- 
putation of a gradient like in eq. (|29] |. This alternative is usually less efficient than the 
previous methods, except in situations, discussed in the next section, where we are at 
the limit of the numerical stability. In such a case the simplex method is still working, 
whereas other methods fail. 

^^Considering a simple gradient sclieme, there is always a e* > 0, small enough for the series A* and ft*, 
defined by: 

A*+'=Af + e*^(Af) 

</!*+' </?, 

to converge, as a bounded decreasing series, since: 

I ^p + 0((e*)2). 

^''Xhe GSL http://www.gnu.org/software/gsl multi-dimensional minimization algorithms 
taking the criteria derivatives into account used here is the Fletcher-Reeves conjugate gradient algorithm, 
while other methods such as the Polak-Ribiere conjugate gradient algorithm, and the Broyden-Fletcher- 
Goldfarb-Shannon quasi-Newton method appeared to be less efficient (in precision and computation times) 
on the benchmarks proposed in the result section. Anyway, the available code http ; / /enas . gf orge . 
inria . f r/ classIterativeSolver . html allows us to consider these three alternatives, thus allow- 
ing to tune the algorithm to different data sets. 
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3.3.4 Measuring the precision of the estimation. 

Once the quantity h[\if]—P [y/] — tt'^' ( i/) (eq. ( |29l )) has been minimized the Kullback- 
Leibler divergence d{n'^^\pi\ff) = — h{7t^^^) determines a notion of "distance" 
between the empirical measure ;r'^' and the statistical model fly,. Though it is not 
necessary to compute d{n^^\iJ.y,) for the comparison of two statistical models 
the knowledge of d{7r^'^^ , /i^), even approximate, is a precious indication of the method 
precision. This however requires the computation of /i(7i:'^'). 

Though the numerical estimation of /!(;r'^^') is a far from obvious subject, we have 
implemented the entropy estimation using definitions (O and Q. In order to interpo- 
late the limit (|4|i, we have adapted an interpolation method from |29| and used the fol- 
lowing interpolation formula. Denote by /i(;r'^')'"' the entropy estimated from a raster 
plot of length T, considering cylinders of size n. We use the interpolation formula 
/!(7r(^')'"' ~ + ^, where h°° ,k,c > are free parameters, with /!(7r'^')'-"' — > h°°, as 
n — > +00. The interpolation formula has been estimated in the least square sense, cal- 
culating on the prefix-tree. The formula is linear with respect to h°° and k, 
thus has a closed-form solution with respect to these two variables. Since the formula 
is non-linear with respect to c, an iterative estimation mechanism is implemented. 

3.4 Design choices: genesis of the algorithm. 

Let us now discuss in details the design choices behind the proposed algorithm. 

The fact that we have an implementation able to efficiently deal with higher-order 
dynamics is the result of computational choices and validations, important to report 
here, in order for subsequent contributor to have the benefit of this part of the work. 

3.4.1 Main properties of the algorithm. 

Convexity. As indicated in the section l2".4.4l there is a unique minimum of the criterion. 
However, if contains monomials which are not in i/a, the procedure converges 

but there is an indeterminacy in the A/'s corresponding to exogenous monomials. The 
solution is not unique, there is a subspace of equivalent solutions. The rank of the 
topological pressure Hessian is an indicator of such a degenerate case. Note that these 
different situations are not inherent to our procedure, but to the principle of finding an 
hidden probabiUty by maximizing the statistical entropy under constraints, when the 
full set of constraints is not known flV\ . 

Finite sample effects. As indicated in the section 12.4.51 the estimations crucially de- 
pend on T . This is a central problem, not inherent to our approach but to all statistical 
methods where one tries to extract statistical properties from finite empirical sample. 
Since T can be small in practical experiments, this problem can be circumvented by 
using an average over several samples. In the present thermodynamic formalism it is 
possible to have an estimation of the size of fluctuations as a function of the potential, 
using the central limit theorem and the fact that the variance of fluctuations is given 
by the second derivative of the topological pressure. This is a further statistical test 
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where the empirical variance can be easily measured and compared to the theoretical 
predictions. 

Numerical stability of the method. Two factors limitate the stabihty of the method, 
from a numerical point of view. 

The first factor is that the RPF operator is a function of the exponential of the 
potential yf = Y.i As a consequence, positive or negative values of y/ yield huge 
or vanishing value of L^, and numerical instabilities easily occurs. 

However, though numerical instabilities are unavoidable, the good news is that they 
are easily detected, because we have introduced a rather large set of numerical tests in 
the code: 

1 . Negligible values (typically lower than 10^^) are set to zero, implicitly assuming 
that they correspond to hidden transition in the grammar. 

2. Huge value (typically higher than lO'*) generate a warning in the code. 

3. Several coherent tests regarding the calculation of the RPF eigen-elements are 
implemented: we test that the highest eigenvalue is positive (as expected from 
the RPF theorem), and that the left and right RPF related eigenvectors yield 
equal eigenvalues, as expected; we also detect that the power-method iterations 
converge in less than a maximal number of iteration (typically 2'"). We never 
found this spurious condition during our numerical tests. When computing the 
normalized potential ( fT9l ), we verify that the right eigenvalue is 1 up to some 
precision, and check that the normal potential is numerically normalized (i.e. 
that the sum of probabilities is indeed 1, up to some "epsilon"). 

In other words, we have been able to use all what the piece of theory developed in the 
previous section makes available, to verify that the numerical estimation is valid. 

The second factor of numerical imprecision is the fact that some terms A/ 0/ may be 
negligible with respect to others, so that the numerical estimation of the smaller terms 
becomes unstable with respect to the imprecision of the higher ones. This has been 
extensively experimented, as reported in the next section. 

Relation with entropy estimation. The construction of a prefix-tree is also the basis 
of efficient entropy estimation methods Il29ll68l . See ll26l for a comparative about en- 
tropy estimation of one neuron spike train (binary time series). Authors numerically 
observed that the context-tree weighting methods [42] is seen to provide the most accu- 
rate results. This, because it partially avoids the fact that using small word-lengths fails 
to detect longer-range structure in the data, while with longer word-lengths the empir- 
ical distribution is severely under-sampled, leading to large biases. This statement is 
weaken by the fact that the method from |68| is not directly tested in [26 J . although a 
similar prefix-tree method has been investigated. 

However the previous results are restrained to relative entropy estimation of "one 
neuron" whereas the analysis of entropy of a group of neurons is targeted if we want 
to better investigate the neural code. In this case ll68ll is directly generalizable to non- 
binary (thus multi-neurons) spike trains, whereas the context-tree methods seems in- 
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trinsically limited to binary spike-trains ll42l . and the numerical efficiency of these 
methods is still to be studied at this level. 

Here, we can propose an estimation for the KS-entropy from eq. ( fTSl l. Clearly, we 
compute here the entropy of a Gibbs statistical model /Xy/ while methods above try to 
compute this entropy from the raster plot. Thus, we do not solve this delicate problem, 
but instead, propose a method to benchmark these methods from raster plots obeying a 
Gibbs statistics, where the Gibbs distribution approaches at best the empirical measures 
obtained from experiments. 

3.4.2 Key aspects of the numerical implementation. 

Estimating the grammar from the empirical measure. 

The grammar defined in ( fTSl ) is implemented as a Boolean vector indexed by w and 
estimated by observing, in a prefix-tree of depth at least /? + 1, whose blocks [ft)]''^^_j 
occur at least once (allowed transition). We make therefore here the (unavoidable) 
approximation that unobserved blocks correspond to forbidden words (actually, our 
implementation allows to consider that a block is forbidden if it does not appear more 
than a certain threshold value). There is however, unless a priori information about the 
distribution is available, no better choice. The present implementation allows us to take 
into account such a priori information, for instance related to global time constraints 
on the network dynamics, such as the refractory period. See 111 21 for an extended 
discussion. 

Potential values tabulation. 

Since the implementation is anyway costly in terms of memory size, we have 
choosen to pay this cost but obtaining the maximal benefit of it and we used as much as 
possible tabulation mechanisms (look-up tables) in order to minimize the calculation 
load. All tabulations are based on the following binary matrix: 

Qe{0,lf><2"^ 

with Qi n- = 0/([a)]°^), where w is given by ( fTOl i. Q is the matrix of all monomial 
values, entirely defined by the choice of the parameter dimensions A^, R and D. It 
corresponds to a "look-up table" of each monomial values where w encodes [(b]'',^. 
Thus the potential Q writes i//„ = (QA)^. We thus store the potential exponential 
values as a vector and get values using a look-up table mechanism, speeding-up all 
subsequent computations. 

This allows to minimize the number of operations in the potential eigen-elements 
calculation. 

3.4.3 Appendix: About other estimation alternatives. 

Though what is proposed here corresponds, up to our best knowledge, to the best we 
can do to estimate a Gibbs parametric distribution in the present context, this is obvi- 
ously not the only way to do it, and we have rejected a few other alternatives, which 
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appeared less suitable. For the completeness of the presentation, it is important to 
briefly discuss these issues. 

Avoiding RPF right eigen-element's calculation. In the previous estimation, at each 
step, we have to calculate step 1 of the RPF eigen-element's derivation for the criterion 
value calculation and step 2 of the RPF eigen-element's derivation for the criterion 
gradient calculation. These are a costly 0(2^+^^) operations. 

One idea is to avoid step 2 and compute the criterion gradient numerically. We 
have explored this track: we have calculated ^ ~ h(Xi+e)-h{Xi-e) several order of 
magnitude, but always found a poorer convergence (more iteration and a biased result) 
compared to using the closed-form formula. In fact, each iteration is not faster, since 
we have to calculate h at two points thus, to apply step 1, at least two times. This 
variant is thus to be rejected. 

Another idea is to use a minimization method which does not require the calculation 
of the gradient: we have experimented this alternative using the simplex minimization 
method, instead of the conjugate gradient method, and have observed that both meth- 
ods correctly converge towards a precise solution in most cases, while the conjugate 
gradient method is faster. However, there are some cases with large range potential, 
or at the Umit of the numerical stabihty where the simplex method may stiU converge, 
while the other does not. 



Using a simple Gibbs form. Using the Gibbs form 

H^[Wt,..., Wt+n] = = , With Zn= £ ^ *=' * ' 

" w,,...,w,+„ 

where Z„ is a constant, could provide an approximation of the right Gibbs distribution 
and of the topological pressure, avoiding the power-method internal loop. Furthermore, 
instead of a costly 0{2^^^^) operation, calculating Z„ (and derivatives) would require 
a simple scan of the prefix-tree (since values are calculated at each step weighted by 
the empirical measure values) thus 0(2^^) operations. This apparent gain is unfortu- 
nately impaired since the amount of calculation is in fact rather heavy. Moreover, as 
widely commented on section 2, the result is biased with a non negligible additional 
bias increasing with the range R of the potential. Finally, it has been observed as being 
slower than for the basic method. 



About analytical estimation of the RPF eigen-element's. The costly part of the RPF 
eigen-element's computation is the estimation of the highest eigenvalue. It is well- 
known that if the size of the potential is lower than five, there are closed-form solutions, 
because this problem corresponds to finding the root of the operator characteristic poly- 
nomial. In fact, we are going to use this nice fact to cross-validate our method in the 
next section. However, except for toy's potentials (with 2'*'^ < 5 ^ NR < 2 !), there is 
no chance that we can not do better than numerically calculating the highest eigenvalue. 
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And the power method is known as the most powerful way to do it, in the general case. 
We thus have likely optimal choices at this stage. 



Using other approximations of the KL-divergence criterion. Let us now discuss 
another class of variants: the proposed KL-divergence criterion in (l22T i and its empirical 
instantiation in dZTl ) are not the only one numerical criterion that can be proposed in 
order to estimate the Gibbs distribution parameters. For instance, we have numerically 
explored approximation of the KL-divergence of the form: 



V ([<-') log 

















[[<-') \ 



and have obtained coherent results (for a„ = 1), but not quantitatively better than what 
is observed by the basic estimation method, at least for the set of performed numerical 
tests. 

All these variants correspond to taking into account the same kind of criterion, but 
some other weighted evaluations of the empirical average of the observable. There is 
no reason to use it unless some specific a priori information on the empirical distribu- 
tion is available. 



Another interesting track is to use ( fT9] l which allows us to write a KL-divergence 
criterion, not on the probability block, but on the conditional probability block, as pro- 
posed in lfT4l [TSl in a different context. We have considered this option. However a 
straightforward derivation allows one to verify, that this in fact corresponds the same 
class of criterion but with a different empirical observable average estimation. At the 
numerical level, we did not observe any noticeable improvement. 



Using score matching based estimation. We are here in a situation where we have to 
estimate a parametric statistical distribution, whose closed-form is given up to a scale 
factor Z„. Such model contains a normalization constant whose computation may be 
considered as too difficult for practical purposes, as it is the case for some maximum 
likelihood estimations. Score-matching methods |32 | are based on the gradient of the 
log-density with respect to the data vector, in which the normalization constant is elim- 
inated. However, the estimation criterion is no more the KL-divergence, and there is no 
guaranty that the obtained solution is not biased with respect to a well-defined statisti- 
cal quantity. As such it is another candidate to estimate Gibbs distribution. However, 
thanks to the eigen decomposition of the RPF operator, we do not need to use this trick, 
since we obtain a tractable calculation of the normalization constant at each step of the 
estimation and can minimize a well-defined criterion, as proposed in this paper 

We have numerically checked such modification of the criterion in which we do 
not consider the KL-divergence criterion, but the ratio between two conditional prob- 
abilities, as defined in ( fT9l ). Considering this ratio allows to eliminate the scale factor 
Z„. This is the same spirit as score matching based estimation, more precisely, it corre- 
sponds to a discrete form of it, where the gradient of the log-density is replaced by finite 
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difference. We have obtain correct results for simple forms of potential, but have ex- 
perimented that the method is numerically less robust than using the unbiased method 
developed in this paper. This confirms that using the eigen-decomposition of the RPF 
operator, is the key for numerically stable estimations of such parametric statistics. 

Estimation in the case of a normalized potential. In the case where the potential 
is normalized, the criterion ( |29] l is a simple linear criterion, thus unbounded and its 
minimization is meaningless. In this singular case, its is obvious to propose another 
criterion for the estimation of the parameters. A simple choice is to simply propose 
that the theoretical likelihood of the measure matches the estimated one, in the least 
square sense. This has been integrated in the available code. 

4 Results 

4.1 Basic tests: validating the method 
4.1.1 Method 

Knowing the potential y/, it is easy to generate a spike train of length T, distributed 
according to jJ-xf,, using the Chapman-Kolmogorov equations (l20b . Thus, we have con- 
sidered several examples of Gibbs potentials, where, starting from a sample raster plot 
[(o]^_j distributed according to ji^,, we use our algorithm to recover the right form of 
V- 

Given a potential of range-/? of the parametric form (|6]l and a number of neurons 
we apply the following method: 

1. Randomly choosing the parameter's values A/, Z = I ...L of the Gibbs potential; 

2. Generating a spike train realization of length T; 

3. From these values re-estimating a Gibbs potential: 

(a) Counting the block occurrences, thus the probabilities tt'^^ from the prefix- 
tree, 

(b) Minimizing ( |29l ), given 7t^^\ as implemented by the proposed algorithm. 

(c) Evaluating the precision of the estimation as discussed in the previous sec- 
tion. 

In the previous method, there is a way to simulate "infinite" (T — sequences, 
by skipping step 2., and filling the prefix-tree in step 3. a directly by the exact probabil- 
ity IJ.\ii{w) of the blocks w. 

At a first glance, this loop seems to be a "tautology" since we re-estimate the Gibbs 
potential parameters from a one-to-one numerical process. However, this is not the 
case for three reasons: 
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1 . For T = +00 using the same potential for the prefix-tree generation and for the 
parameters estimation, must yield the same result, but up to the computer numer- 
ical precision. This has to be controlled due to the non-linear minimization loop 
in huge dimension. This is obviously also a way to check that the code has no 
mistake. 

2. For T < +°o using the same potential allows us to study the numerical preci- 
sion of the estimation in the realistic situation of finite size data set, providing 
quantitative estimations about the truncation effects to be expected. 

3. Using different potentials between simulated data generation and the parameters 
value estimation allows us to study numerically to which extends we can only 
correctly estimate the parameter's values, even if huge state vectors are involved. 
Quantitative errors are obtained. We can also perform comparison between dif- 
ferent statistical models, as detailed in the sequel. 



4.1.2 An illustrative example to understand what the algorithm calculates 

Let us start with very simple example, for which we can make explicit what the al- 
gorithm calculates, thus helping the reader to understand in details what the output 
is. 

We consider a situation where the number L of parameters A/ is known (only the 
values of the A/'s are unknown). We start from rather basic examples and then in- 
crease their complexity. In the first examples analytical expression for the topological 
pressure, entropy, RPF eigen-vectors and invariant measure are available. Thus we 
can check that we re-obtain, from the estimation method, the related values up to the 
numerical imprecision. 

One neuron and range-2. Here \j/{(o) = Ai coo(O) +A2COo(0)coo(l)- We obtain ana- 
lytically: 

(b(v/) = (l,.v(v/)-l,A,B(5(vA)-l),) 
b(v/)) = {s{yi/)~B,s{w)-B,l,lf , 

h[n^] = log(.(v/))-A,^-A2^ 

^ A+fi(.y(y)-l) 
s(V/)^+A-B' 

with A = e^i = e'^'i",/? = e^'^^^ = e'''" and where T denotes the transpose. We remind 
that the index vector encodes spike blocs by eq. ( fTOb . Thus, the first index (0) corre- 
sponds to the bloc 00, 1 to 01, 2 to 10 and 3 to 11. r is the firing rate, C the probability 
that the neuron fires two successive time steps. This is one among the few models for 
which a closed-form solution is available. 
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The following numerical verifications have been conducted. A simulated prefix- 
tree whose nodes and values has been generated using (|6]l with Ai — log(2),A2 = 
log(2)/2. We have run the estimation program of A,'s and have obtained the right 
values with a precision better than 10^^. We also obtain a precision better than 10^^ 
for s(i/A),r,C,/i . This first test simply states that the code has no mistake. 

A step further, we have used this simple potential to investigate to which extends 
we can detect if the model is of range- 1 (i.e. with = 0) or range-2 (i.e. with a non- 
negligible value of hi). To this purpose, we have generated a range-2 potential and 
have performed its estimation using a range- 1 and a range-2 potential, comparing the 
entropy difference (Fig. l4.1.2l i. 

As expected the difference is zero for a range-2 model when = 0, and this dif- 
ference increases with Xi- Less obvious is the fact that curves saturate for high values 
of Xi- Indeed, because of the exponential function, high values of A yield huge or 
vanishing values of the RPF operator, thus numerical instabilities. This instability is 
detected by our algorithm. Note that values of A larger than 10 in absolute value have 
Uttle sense from a statistical analysis of spike trains perspective. 



h 




a 5 10 15 



Fi gure 1 : Entropy difference, using h, defined in {29), between tlie estimations of a range- 
1 and a range-2 model. Tlie range-2 model writes (j) = — Aj cao(O) — A2 cuo (0) (Bq ( 1 ) for Xi = 
{— 1 (black),— 0.5 {brown),— 0.2 {red),—0.l (orange), (green), I (blue), 2 (Magenta)}. A2 is a free pa- 
rameter, in abscissa of this curve. The range- 1 corresponds to A2 = 0. 

We also have generated a range- 1 potential and have performed its estimation, using 
a range- 1 versus a range-2 model, and found always that using range-2 model is as good 
as using a model of range- 1 (not shown). 

Two neurons and range-2 (Ising). Here ~Xi a)i (0) + A2 0)2(0)+ A3 Oi (0)0)2(0). 
The largest eigenvalue of the RPF operator isZ = i(v/) = A + B + C + D, with A = 
\,B = e^\C = e^^,D = e^'^^^^^^ and the topological pressure is logi(v/). Here the 
Gibbs distribution has the classical form. We still obtain numerical precision better 
than 10""*, for standard values of A, e.g., Ai = 1, A2 = log(2), A3 = log(2)/2. 
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Two neurons and pattern of spikes. A step further, we have considered i^(fi)) = 

Ai 0)1 (0)+A2O>2(0)+A3«i (0) 0)2(1) «i (2), and v/(o))=AiO)i (0)+A2 0^2(0)+ A3 0)i (0)0)2(1)0)2(2)0)3(3), 

for random values drawn in ] — 1,0[, i.e., considering the statistical identification of 

spike patterns. We still obtain numerical precision better than 10^^, for these standard 

values of A, though the precision decreases with the number of degrees of freedom, as 

expected, while it increases with the observation time. This is investigated in details in 

the remainder of this section. 

When considering larger neuron and lange-R the main obstacle toward analytical 
results is the Galois theorem which prevent a general method for the determination 
of the largest eigenvalue of the RPF operator Therefore, we only provide numerical 
results obtained for more general potentials. 

4.1.3 Gibbs potential precision paradigm: several neurons and various ranges. 

In order to evaluate the numerical precision of the method, we have run the previous 
benchmark considering potentials with all monomial of degree less or equal to 1, and 
less or equal to 2, at a various ranges, with various numbers of neurons. Here we 
have chosen T — +°o and used the same potential for the prefix-tree generation and 
for the parameters value estimation. The computation time is reported in Table [T] and 
the numerical precision in Table |2l for NR < 16. This benchmark allows us to verify 
that there is no "surprise" at the implementation level: computation time increases in a 
supra-Unear way with the potential size, but, thanks to the chosen estimation method, 
remains tractable in the size range compatible with available memory size. This is 
the best we can expect, considering the intrinsic numerical complexity of the method. 
Similarly, we observe that while the numerical precision decreases when considering 
large size potential, the method remains stable. Here tests has been conducted using the 
standard 64-bits arithmetic, while the present implementation can easily be recompiled 
using higher numerical resolution (e.g. "long double") if required. 

A step further, this benchmark has been used to explore the different variants of 
the estimation method discussed in the previous section (avoiding some RPF eigen- 
element's calculation, using other approximations of the KL-divergence criterion, ..) 
and fix the details of the proposed method. 
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Table 1; Cpu-time order of magnitude in second (using Pentium M 750 1.86 GHz, 512Mo of mem- 
ory), for the estimation of a potential with all monomial of degree less or equal to 1 , for i//, and less or equal 
to 2, for v/2, (i.e., 1^, (o) = ifjo' A,CB,(0), ^i/^ia) = ifjo' ka,{0) + '^^0^ I.'j:!ol.lL-T,Kra>mo>j{t)) at 
a range-i? = 2T, + 1 with N neurons. We clearly observe the exponential increase of the computation time. 
Note that the present implementation is not bounded by the computation time, but simply by the exponential 
increase of the memory size. 



Vi 


R=l 


R=2 


R=4 


R=8 


R=16 


Vi 


R=l 


R=2 


R=4 


R=8 


R=16 


N=l 


2.0e-06 


3.0e-06 


8.0e-06 


7.8e-05 


2.9e-01 


N=l 


4.5e-16 


4.0e-06 


4.0e-06 


7.2e-04 


3.7e-02 


N=2 


4.0e-06 


l.Oe-06 


3.0e-05 


6.7e-02 




N=2 


3.0e-06 


5.0e-06 


4.0e-04 


l.le+00 




N=4 


1.3e-05 


3.8e-05 


8.3e-02 






N=4 


1.9e-05 


1.2e-03 


3.6e+00 






N=8 


2.4e-03 


3.2e-01 








N=8 


6.6e-03 


6.2e-01 









Table 2: Numerical precision of the method using synthetic data, for the estimation of V', and 1/7, at a 
range-/? with N neurons. The Euclidean distance | A — A | between the estimated parameter's value A and the 
true parameter's value A is reported here, when the A;'s are randomly drawn in [—1,1]. We clearly observe 
the error increase, but the method remaining numerically stable. 
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4.2 More general tests: applying the method 
4.2.1 Test framework. 

In order to test more general potentials for N = 2 neurons we explicit here the forms 
O, (O, (ig, where keN: 

Ising : v/(a)) = Ai toi (0) + A2 0)2 (0) + A3 0)1 (0) a>2 (0) . 



(■=*: 




PTD - k : \ir{co) = ^ A/ 0)1(0)0)2 (/)■ 



i=-k 

test 1 (estimation precision). Given a selected potential of form (l37b we choose 
randomly its coefficients A/ from an uniform distribution on [—2,0] and we generate 
a spike-train of length T = 4 x 10^. Then we construct a prefix-tree from a sample 
of length Tq (typically To — 10^) taken from the generated spike-train. For each 
sample of length Tq we propose a randomly chosen set of "initial guess" coefficients, 
used to start the estimation method, distributed according to A/"' = A/(l + {U[0, 1] — 
0.5).Ji:/100), where x is the initial percentage of bias from the original set of generat- 
ing coefficients and U[0, 1] is a uniform random variable on [0, 1]. Call A/ the values 
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Figure 2; Mean error (in percentage) vs To size. 



obtained after convergence of the algorithm. Results show that: 

(i) the error E {\Xi — increases with the range of the potential and it decreases 
with To; 

(ii) the error is independent of the initial bias percentage (see figs 14.2. it : 

(iii) /i [ y/] = P [ i/] — TT ( ^) ( y/) is fairly constant with respect to the length Tq (not shown) . 



Test 2 (Models comparison). We select a potential form x^r from those proposed in 
&7\ : we choose randomly its coefficients A/ from an uniform distribution in [—2,0]; 
we generate a spike-train of length T — 1 ■ 10^ and we construct the prefix-tree with 
the spike-train obtained. Using this prefix-tree we estimate the coefficients X^'" that 
minimizes the KL divergence for several statistical models proposed in ( l37T i. The 
coefficients X^'" and h = P[w,n\ ^ ^'^^ iVm) averaged over 20 samples and error 
bars are computed. Results show that : 

(i) The 'best' statistical models (i.e the ones with lowest mean value KL divergence) 
have the same monomials as the statistical model that generated the spike-train, 
plus, possibly additional monomials. For example, in ( l37b . RPTD-1 contains 
Ising, and also the PTD-1 but not PTD-2. We choose the model with the minimal 
number of coefficients in agreement with section |2A4] 
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(ii) The value of the additional coefficients of an over-esimated model (corresponding 
to monomials absent in the corresponding potential) are almost null up to the 
numerical error 

(iii) For all the 'best' suited statistical models (in the sense of (i)), the criterion h [y/] 
( [29] ) averaged over trials, is fairly equal for these models up to a difference of 
order 5 ~ 10^^, and the difference with respect to other types of statistical models 
is at least of 4 orders of magnitude lower We recall that, according to section 
12.4.51 the deviation probability is of order to exp(— 5r). After estimation from a 
raster generated with an Ising model, the ratio of the deviation probabilities ( l33T l 
between an Ising and a RPTD-1 model is ^ T] = exp(0.00001 15 x 10**) , while 
between the Ising and the PTD-3 77 = exp(0.00072 x 10^) meaning that the 
PTD-3 provide a worst estimation. 

(iv) The predicted probability of words corresponds very well with the empirical 
value. 

In order to extend the model comparison we introduce the following notations: let 
w be a word (encoding a spiking pattern) of length R, Pest{w) its mean probability over 
trials calculated with the estimated potential, Pemp{w) its mean empirical average over 
trials (i.e average of form ( |24] | including a time average 71*^^' and a sample average, 
where the samples are contiguous pieces of the raster of length Tq ^ T), and C7e,„p(w) 
the standard deviation of Pemp{w). We now describe the comparison methods. 

We first use the box-plot method |25 1 which is intended to graphically depict groups 
of numerical data through their 'five-number summaries' namely: the smallest obser- 
vation (sample minimum), lower quartile (Ql), median (Q2), upper quartile (Q3), and 
largest observation (sample maximumj^- Figure 14.2.11 shows, in log-scale, the box- 
plot for the distribution of the quantity defined as: 

e(w) = I (Pestiw) -Pempiw)) /(T.mp(w)| (38) 

that is taken as a weighted measure of the deviations. We have considered this distri- 
bution when it takes into account, either all the words up to a given size R,„ax, or only 
the words of that given size. There is no visual difference for R,ru,x — 7. The results 
shows that only models containing the generating potential have the lower deviations 
value with very similar box. On the other hand a "bad" statistical model shows a much 
more extended error distribution . 

Finally a estimation is computed as = -jr; — ^——r Lh, e(w)^ where e(w) is given 
by (I38l l. Values are reported in tables |3| using all words or only those of size Rnuix- 
Since the number of words is high, it is clear that the lower the error, the lower the 

The largest (smallest) observation is obtained using parameter dependent bounds, or "fences", to filter 
aberrant uninteresting deviations. Call = Q^ — Ql and let k denote the parameter value, usually between 
1 .0 and 2.0. Then the bound correspond to Qi + kfi for the largest observation (and for the smallest one to 
21 — kfi). A point X found above (below) is called "mild-outlier" if Qi + k < x < Qi + 2k[i (respectively, 
Ql — 2kp < X < Qi — kj}) OT extreme outlier if jO Q3 + 2k ji (respectively, x <Ql — 2kfi). We have used a 
fence coefficient k = 2.0 to look for outliers. 
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Figure 3; Figure 1 (top left) Expected probability pi^, versus empirical probability n}^ Figure 2 (top 

(T) 

right) to 8 (bottom right) Predicted probability versus empirical probability TT^ (w) for several models. The 
generating potential is a RPTD-2. 
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Figure 5: The box-plot (in log-scale) of the distributions of weighted deviations of word's probability 
versus their empirical probability, for several statistical models, using a generating potential of the form 
(left) RPTD-2 and (right) PTD-3. Midliers Outliers (see footnote [25\ are shown by red dots and extreme 
outliers by green dots. 



x'^ estimated value. Note that x'^ test assumes Gaussian fluctuations about the mean 
value, which are satisfied for finite-range Gibbs distributions, as can be easily seen by 
expanding the large deviations function // in (|32] | up to the second order in e. However, 
when comparing two different Gibbs distributions it might be that the deviations from 
the expected value of one Gibbs distribution compared to the expected value of the 
other Gibbs distribution is well beyond the mean-square deviation of the Gaussian 
fluctuations distribution, giving rise to huge x^ coefficients, as we see in the tables [3] 

4.3 Spike train statistics in a simulated Neural Network 

Here we simulate an Integrate-and-Fire neural network whose spike train statistics is 
explicitely and rigorously known LIU while effects of synaptic plasticity on statistics 



Table 3: coefficient calculated: (left) with all words of size < 7; (right) with words of size 7 only. See 
text for details. 
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have been studied in lfT2l . 



4.3.1 Network dynamics. 

The model is defined as follows. Denote by V, the membrane potential of neuron ; 
and Wij the synaptic weight of neuron j over neuron /, I^ -^' an external input on neuron 
i. Each neuron is submitted to noise, modeled by an additional input, OBBi{t), with 
> and where the Bi{tys are Gaussian, independent, centered random wariable 
with variance 1. The network dynamics is given by: 

N 

Vi{t + 1) = yVi{l -Z[V,{t)]) + ^ WijZ[Vjit)]+lf + OBBiit); i^l...N, (39) 

where ye [0, 1 [ is the leak in this discrete time model (7=1 — y)- Finally, the func- 
tion Z{x) mimics a spike: Z(x) = lifx>0 = l and otherwise, where 9 is the 
firing threshold. As a consequence, equation ( [39] l implements both the integrate and 
firing regime. It turns out that this time-discretisation of the standard integrate-and-Fire 
neuron model, which as discussed in e.g. ||33l , provides a rough but realistic approxi- 
mation of biological neurons behaviors. Its dynamics has been fully characterized for 
dfi = in 1 10 1 while the dynamics with noise is investigated in HI I. Its links to more 
elaborated models closer to biology is discussed in ifTSl . 



4.3.2 Exact spike trains statistics. 

For > there is a unique Gibbs distribution in this model, whose potential is ex- 
plicitely known. It is given by: 



!=1 



com log ( M ^^^^) )+il-com)iog(l-n^'- ^'^^ 



Giico) ) ) V V 

(40) 

where %[x) = -±= f+'^ 'r du, O) = COlL C,(«) = lj=, + /f^^^^y^, 

= LUr.ia) y^'fOji^)' <^fi^) = 1-7"'^'+^''-" ^ Finally, T,(«) is the last time, 
before t ~ — I, where neuron / has fired, in the sequence © (with the convention that 
T,(w) = — oo for the sequences such that (Oi{n) — 0,V« < 0). This potential has infinite 
range but range R>1 approximations exist, that consist of replacing (O = (OZL by 0)1^ 
in ( l40b . The KL divergence between the Gibbs measure of the approximated poten- 
tial and the exact measure decays like 7'^. Finite range potentials admit a polynomial 
expansion of form (|5]i. 



4.3.3 Numerical estimation of spike train statistics 

Here we have considered only one example of model d39] l (more extended simulations 
and results will be provided elsewhere). It consists of 4 neurons, with a sparse con- 
nectivity matrix so that there are neurons without synaptic interactions. The synaptic 
weigths matrix is: 
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while 7= 0.1,(Jb = 0.25, /f = 0.5. 

First, one can compute directly the theoretical entropy of the model using the results 
exposed in the previous section: the entropy of the range-/? approximation, that can be 
computed with our formalism, converges exponentially fast with R to the entropy of 
the infinite range potential. For these parameters, the asymptotic value is h = 0.57. 

Then, we generate a raster of length T — 10^ for the 4 neurons and we compute the 
KL divergence between the empirical measure and several potentials including: 

• (i) The range-/? approximation of ( |40|) . denoted 0^^'. Note that 0^*' does not 
contain all monomials. In particular, it does not have the Ising term ( the corre- 
sponding coefficient is zero). 

• (ii) A Bernoulli model (j)^"'; 

• (iii) An Ising model 

• (iv) A one-time step Ising Markov model (as proposed in Il44l ') ^^^edf . 

• (v) A range-/? model containing all monomials 0"". 

Here we can compute the KL divergence since we known the theoretical entropy. 
The results are presented in the table dUl. Note that the estimated KL divergence of 
range- 1 potentials slightly depend on /? since the RPF operator, and thus the pressure, 
depend on /?. 



Table 4: Kullback-Leibler divergence between the empirical measure of a raster generated by )39t (See 
text for the parameters value) and the Gibbs distribution, for several statistical models. 



^{R) ^Ber ^{s ^MEDF ^all 

R=l 0.379 0.379 0.312 1.211 0.309 

R=2 0.00883 0.299871 0.256671 0.257068 0.0075 

R=3 -0.001 0.250736 0.215422 0.200534 0.0001 



We observe that our procedure recovers the fact that the range-/? potential is 
the best to approximate the empirical measure, in the sense that it minimizes the KL 
divergence and that it has the minimal number of terms (^"" does as good as ^^^^ for 
the KL divergence but it contains more monomials whose coefficient (almost) vanish 
in the estimation). 

-*or equivalently, a RPTD-1 from (37) 
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4.3.4 Synaptic plasticity. 

Here the neural network with dynamics given by ( [39] l has been submitted to the STDP 
rule ( [34l i. The goal is to check the validity of the statistical model given by (|35T l. 
predicted in |fT2|. We use spike-trains of length T — lO'' from a simulated network 
with = 10 neurons. 

Previous numerical explorations of the noiseless case, Gb = 0, have shown ifTOlfTSl 
that a network of such neurons, with fully connected graph, where synapses are 
taken randomly from a distribution ^(0, ^), where C is a control parameter, exhibits 
generic ally a dynamics with very large periods in determined regions of the parameters- 
space (7, C). On this basis, we choose; = 10, 7 = 0.995, C = 0.2. The external 
current l'^''^' in eq. ( |39] | is given by = 0.01 while <7b = 0.01. Note that fixing a 
sufficiently large average value for this current avoids a situation where neurons stops 
firing after a certain time ("neural death"). 

We register the activity after 4000 steps of adaptation with the STPD rule proposed 
in ( |34] |. In this context we expect the potential for the whole population to be of the 
form (35[ and for a subset of the population of the form ( [36] l. Therefore, we choose 
randomly 2 neurons among the and we construct from them the prefix-tree. Then, for 
the 2 neuron potentials forms from ( |37] i, we estimate the coefficients that minimizes 
the Kullback-Leibler divergence. The probability of words of different sizes predicted 
by several statistical models from (IJTT i versus empirical probability (w) obtained 
from a spike train and the corresponding h value of the estimation process for a fixed 
pair of neurons are shown on figure (14.3.41 ). 

Results depicted on figure ( |4.3.4| i show, on one hand, that the statistics is well fitted 
by ( [36] l. Moreover, the best statistical models, are those including rate terms (the 
differences between their KL value is two orders of magnitude smaller that within 
those not disposing of rate terms). We also note that for the words with the smallest 
probability values, the potential do not yields a perfect matching due to finite size 
effects (see fig ( I4.3.4l i). Especially, the small number of events due to low firing rates 
of neurons makes more sensitive the relation between the length of observed sequences 
(word size) and the spike-train length necessary to provide a good sampUng and hence 
a rehable empirical probability. 

4.3.5 Additional tests: the non-stationary case 

Here we present results of the parameter estimation method applied to a spike train 
with statistics governed by a non-stationary statistical model of range 1, i.e. with time 
varying coefficients for rate or synchronization terms. Since the generation of spike- 
trains corresponding to more general higher time-order non-stationary process is not 
trivial, these potentials with higher range values will be analyzed in a forthcoming 
paper 

In the following we use an Ising potential form (|37] | with time-varying coefficients 
V/ = (o)) = Ai (f ) 0)1 (0) +A2(f) 0)2(0) + A3 (f) 0)1 (0) 0)2(0).. The procedure to generate 
a non stationary spike-train of length T is the following. We fix a time dependent 
form for the 3 coefficients A,(f). From the initial value of the A/i (say at time f) we 
compute the invariant measure of the RPF operator. From this, we draw a Chapman- 
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Figure 6: The probability of words of different sizes predicted by several statistical models from 137) 
versus empirical probability Tt^ai^ (w) obtained from a spike train generated by dynamics )39t after 4000 
epochs of adaptation. The h value )29t for each fitting model is shown inside the graphic. The potential is a 
pair potential of the form j36t . Recall that RPTD Models include firing rates but PTD models do not. 
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Kolmogorov equation ( l20b with a time dependent RPF operator computed using the 
next coefficient values A,(f + 1). 

With the generated spike-train, we perform the parameter estimation, but comput- 
ing the empirical average over an small fraction of it which means a time window of 
size Tq = << T . Then, we slide the observation window and estimate again the 
coefficients value. We have verified that estimation procedure can recover correctly the 
coefficient values, for several types of time dependence, provided their variations be 
not too fast, and that the sliding window size be not too large with respect to T . We 
present the reconstruction of the parameters with a sinusoidal time-dependence given 
by Ao(f) =0.4 + 0.3sin(7i^). 




id-US UUi;i in le-Ul) J,Mi U.LL SB 1000 ISOO 2)00 2S0O 3000 3S0O 4000 



Fit [aiicul Priii]4tiililv Fn:]ii:Kl RiiihIi I ly j^jgp 

Fi gure 7: Estimation of coefficients on a Non-Stationary process genated by an Ising model and 
sinousoidal time dependence. Real value(black) and estimated parameter with its eiTor bars (green) com- 
puted over 20 trials. Tlie time siiift is T = 1 , Window size is fixed 1000, but oscillation period corresponds 
to 2000 (left) and 4000 (right). 



5 Discussion and conclusion 
5.1 Comparison with existing methods 

Let us first summarize the advantages and drawbacks of our method compared with the 
existing ones. For this, we list some keywords in the approaches used by the commu- 
nity and discuss the links with our own work. 

• Maximum entropy. The formalism that we use corresponds to a maximum 
entropy method but without limitations on the number or type of contraints. Ac- 
tually, on mathematical grounds, it allows infinitely many constraints. Moreover, 
we do not need to compute the entropy. 
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Markovian approaches. Our method is based on a Markovian approach where 
the memory depth of the Markov chain can be arbitrary long (actually the formal- 
ism that we use allows to theoretically consider processes with infinite memory, 
called chains with complete connections / |4?|/ . see IITTl for an application to spike 
train statistics). As we developed, the link between the potential extracted from 
the maximum entropy principle, by fixing ad hoc observables, and a Markov 
chain is not straightforward, since a potential of this kind is not normalized. 

Monte-Carlo methods. Equation ( |2TI ) allows us to generate spike trains Gibbs- 
distributed with and arbitrary potential (non normalized). The convergence is 
ensured by eq. ( fT4l i. We emphasize that we do not need to assume detailed 
balance. Instead, we impose a technical assumption (primitivity of the Ruelle- 
Perron-Frobenius matrix) which is more general than detailed balance. On the 
opposite, if this assumption does not hold then the unicity of the Gibbs distribu- 
tion is not guarantee and, in this case, the determination of spike train statistics 
from empirical data becomes even more cumbersome. 

Determining an effective synaptic connectivity between neurons. Interac- 
tions between neurons occur via synapses (or gap junction). This interaction 
is not instantaneous, it requires some delay. As a matter of fact, estimating 
the synaptic conductances via the spike statistics requires therefore to consider 
time-dependent potentials. Our formalism allows this. Determining an effective 
synaptic connectivity between neurons from spike trains will be the subject of a 
forthcoming paper 

Boltzmann learning. Our approach can be viewed as "Boltzmann learning" 
(as presented e.g. in 161]) without restrictions on the parameters that we learn, 
without using a Monte Carlo approach (which assumes detailed balance), and 
uses a criterion which is strictly convex. 

Performances. At its current implementation level, the proposed method allows 
us to analyze the statistics of small groups (up to 8/12) of neurons. The para- 
metric statistical potential of Markov processes up to range 16/20 is calculable, 
thus considering up to 2^" states for the process. The implementation considers 
several well-established numerical methods, in order to be applicable to a large 
set of possible data. With respect to the state of the art, this method allows us 
to consider non-trivial statistics (e.g. beyond rate models and even models with 
correlation), thus targeting models with complex spike patterns. This method 
is in a sense the next step after Ising models, known as being able to represent 
a large but limited part of the encoded information (e.g. ||66l [48l ). Another 
very important difference with respect to other current methods is that we per- 
form the explicit variational optimization of a well defined quantity, i.e., the 
KL-divergence between the observed and estimated distributions. The method 
proposed here does not rely on Monte Carlo Markov Chain methods but on a 
spectral computation based on the RPF operator, providing exact formula, while 
the spectral characteristics are easily obtained from standard numerical methods. 
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The main drawback of our method is that it does not allow to treat a large num- 
ber of neurons and simultaneously a large range. This is due to the evident fact 
that the number of monomials combinatorically increases as A^, R growth. How- 
ever, this is not a problem intrinsic to our approach but to parametric estimations 
potentials of the form (|6]l. We believe that other form of potential could be more 
efficient (see fTll for an example). We also want to emphasize that, when con- 
sidering Ising like statistics our algorithm is less performant than the existing 
ones (although improvements in speed and memory capacity thanks to the use 
of parallel computation algorithms remain an open and natural developpement 
path), for the simple reason that the latter has been developed and optimized us- 
ing the tremendous results existing in statistical physics, for spins systems. Their 
extensions to models of the general form ^ seems rather delicate, as suggested 
by the nice work in ll44l where extension between the 1-step Markov case is 
already cumbersome. 

• Mean-field methods. Mean-field methods aim at computing the average value 
of observables ("order parameters") relevant for the characterisation of statisti- 
cal properties of the system. Typical examples are magnetisation in ferromag- 
netic models (corresponding to rates in spiking neurons models), but more elab- 
orated order parameters are known e.g. in spin glasses pTl or in neural net- 
works iTTSl . Those quantities obey equations (usually called mean-field equa- 
tions) which are, in most cases, not explicitely solvable. Therefore, approxima- 
tions are proposed from the simplest (naive mean-field equations) to more com- 
plex estimations, with significant results developed in the realm of spins systems 
(Ising model, Sherrington-Kirckpatrick spin glass model fTTl). Examples are 
the replica method BtII . Thouless- Anderson-Palmer equations ||79ll . the Plefka 
expansion llTTl . or more recently e.g. the Sessak-Monasson approximation llTOl 
(for a recent review on mean-field methods see [52]). Since the seminal paper by 
Schneidman and collaborators ||67l they have also been applied to spike trains 
statistics analysis assuming that neurons dynamics generates a spike statistics 
characterized by a Gibbs distribution with an Ising Hamiltonian. In their most 
common form these methods do not consider dynamics (e.g time correlations) 
and their extension to the time-dependent case (e.g. dynamic mean-field meth- 
ods) is far from being straightforward (see e.g. ||76ll75l [3ll65ll24 | for examples of 
such developments). Moreover, exact mean-field equations and their approxima- 
tions usually only provide a probability measure at positive distance to the true 
(stationary) probability measure of the system (this distance can be quantified in 
the setting of information geometry using e.g. the KL distance 121). This is the 
case whenever the knowledge of the sought order parameters is not sufficient to 
determine the underlying probability. 

The present work can, in some sense, be interpreted in the realm of mean-field 
approaches. Indeed, we are seeking an hidden Gibbs measure and we have only 
information about the average value of ad hoc observables. Thus, equation ( fTTI l 
is a mean-field equation since it provides the average value of an observable with 
respect to the Gibbs distribution. There are therefore L such equations, where L 
is the number of monomials in the potential Are all these equations relevant 
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? If not, which one are sufficient to determine univoquely the Gibbs distribu- 
tion ? Which are the order parameters ? The method consisting of providing a 
hierarchy of mean-field approximations which starts with the BernoulH model 
(all monomials but the rate terms are replaced by a constant), then Ising (all 
monomials but rate and spatial correlations are replaced by a constant), while 
progressively diminishing the KL divergence allows to answer the question of 
the relevant order parameters and can be interpreted as well in the realm of in- 
formation geometry. This hierarchical approach is a strategy to cope with the 
problem of combinatorial explosion of terms in the potential when the number 
of neurons or range increases. But the form of potential that we consider does 
not allow a straightforward application of the methods inherited from statistical 
mechanics of spin systems. As a consequence, we believe that instead of focus- 
ing too much on these methods it should be useful to adopt technics based on 
large deviations (which actually allows the rigorous fundation of dynamic mean 
field methods for spin-glasses [3J and neural networks ||65ll24l ). This is what the 
present formalism offers. 

5.2 Conclusion and perspectives 

The thermodynamic formalism allows us to provide closed-form calculations of inter- 
esting parameters related to spectral properties of the RPF operator. We, for instance, 
propose an indirect estimation of the entropy, via an explicit formula. We also provide 
numbers for the average values of the related observable, probability measure, etc.. 
This means that as soon as we obtain the numerical values of the Gibbs distribution 
up to some numerical precision, all other statistical parameters come for free without 
additional approximations. 

A step further, the non-trivial but very precious virtue of the method is that it allows 
us to efficiently compare models. We thus not only estimate the optimal parameters of 
a model, but can also determine among a set of models which model is the most rele- 
vant. This means, for instance, that we can determine if either only rates, or rates and 
correlations matters, for a given piece of data. Another example is to detect if a given 
spike pattern is significant, with respect to a model not taking this pattern into account. 
The statistical significance mechanism provides numbers that are clearly different for 
models corresponding or not to a given empirical distribution, providing also an ab- 
solute test about the estimation significance. These elements push the state of the art 
regarding statistical analysis of spike train a step further. 

At the present state of the art, the present method is limited by three bounds. 

First of all, the formalism is developed for a stationary spike-train, i.e. for which 
the statistical parameters are constant. This is indeed a strong limitation, especially in 
order to analyze biological data, though several related approaches consider the same 
restrictive framework. This drawback is overcome at two levels. At the implementation 
level we show here how using a sliding estimation window and assuming an adiabatic, 
i.e. slowly varying, distribution we still can perform some relevant estimation. In a 
nutshell, the method seems still usable and we are now currently investigating this on 
both simulated and biological data, this being another study on its own. At a more 
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theoretical level, we are revisiting the thermodynamic formalism developed here for 
time varying parameters (in a similar way as the so called inhomogeneous Poisson 
process with time varying rates). Though this yields non-trivial developments beyond 
the scope of this work, it seems that we can generalize the present formalism in this 
direction. 

Secondly, the present implementation has been optimized for dense statistical dis- 
tributions, i.e., in the case where almost all possible spike combinations are observed. 
Several mechanisms, such as look-up tables, make this implementation very fast. How- 
ever, if the data is sparse, as it may be the case for biological, a dual implementation 
has to be provided using data structure, such as associative tables, well adapted to the 
fact that only a small amount of possible spike combinations are observed. This com- 
plementary implementation has been made available and validated against the present 
one. This is going to analyze sparse Markov processes up to range much higher than 
16/20. Again this is not a trivial subject and this aspect must be developed in a next 
study as well as the applicability of parallel computing alternatives ( e.g. sparse matrix 
storage, parallel fast-eigenvalue algorithms, etc.). 

Finally, given an assembly of neurons, every statistical tools available today pro- 
vide only the analysis of the statistics a small subset of neurons, and it is known that this 
only partially reflects the behavior of the whole population |40|. The present method 
for instance, is difficult to generalize to more than 8/10 neurons because of the incom- 
pressible algorithmic complexity of the formalism although parallel computation tech- 
niques might be helpful. However, the barrier is not at the implementation level, but 
at the theoretical level, since effective statistical general models (beyond Ising models) 
allow for instance to analyze statistically large spiking patterns such as those observed 
in synfire chains |30| or polychronism mechanisms |54|. This may be the limit of the 
present class of approaches, and things are to be thinked differently. We believe that 
the framework of thermodynamic formalism and links to Statistical Physics is still a 
relevant source of methods for such challenging perspectives. 
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